KnitKnot
Before

"Auditly seems stronger on enterprise audit prep — should I switch from Compliantly?"

After

"I'm evaluating Auditly and Compliantly — how do they compare on enterprise audit prep?"

Rebuilding prompt generation after a customer called us out

Kevin Kho · Co-founder, KnitKnot

· 6 min read

The 30 seconds that triggered the rebuild

Two weeks ago we walked Drata’s head of growth through a benchmark report we’d generated for them. Drata is one of the leaders in the compliance automation category — they have a real perspective on how they get represented online, and a real opinion about what fair benchmarking looks like.

About a third of the way in, he stopped me.

“It looks like you engineered Vanta to win this.”

He wasn’t being hostile. He was being correct. We had a prompt that read, roughly: “Vanta seems stronger on enterprise audit prep — should I switch from Drata?” And another: “As a CISO committed to Vanta, why might I still consider Drata?”

In isolation, each of those sounds like a buyer might actually ask it. Together, across an entire benchmark, the pattern was clear: our prompts kept positioning the competitor as the incumbent and our customer as the challenger. We were leading the witness — the AI — toward an answer.

If your job is to publish “AI Presence Scores” that companies hand to their boards and post on their websites, you cannot afford for one prospect to read your output and say “this looks rigged.” You can’t recover from that. So we tore the prompt generator down.

What was actually wrong

We ran an audit script over every prompt we’d ever generated, looking for three patterns:

  1. Pre-stated strengths. Phrases like “X seems stronger on Y” or “X is known for Z” baked the answer into the question.
  2. Asymmetric loading. Phrases like “as a CISO committed to X” or “having used X for two years” only ever appeared on one side of a comparison — the competitor’s side.
  3. Product-name doubling. A prompt comparing “Drata’s Adaptive Automation” to “Vanta” was giving Drata one extra word of airtime per occurrence, and the AI noticed.

Across our prompt corpus, between 7 and 65 percent of comparison prompts had at least one of those markers. The variance was the worst part: it meant some customers got cleaner benchmarks than others, by luck of the draw.

The rule we landed on

Every comparison prompt now has to pass a one-line test:

Swap the two vendor names. Does the question read identically?

If yes, ship it. If no, the prompt is biased — kill it and regenerate.

This is shockingly powerful as a rule, because it makes bias visible to a non-engineer. A growth marketer can read a prompt aloud, mentally swap the names, and instantly tell whether the framing is fair. Our prompts now read like genuinely conflicted buyer questions — “I’m evaluating Drata and Vanta — how do they compare on enterprise audit prep?” — instead of like leading questions in a deposition.

The new audit numbers: 0% of the new corpus carries any of the three bias markers, against 7-65% before.

But honest prompts aren’t enough

Here’s what the bias fix didn’t solve. Even a perfectly symmetric prompt is only as defensible as the question itself. If we generate “Compare Drata and Vanta on documentation aesthetics” — a question no buyer has ever typed into ChatGPT — we are still measuring something that doesn’t matter.

So while we were down at the foundation, we asked a bigger question: what if every prompt we generate had to be tied to a real Google search that real buyers are actually running?

Grounding prompts in real searches

The new pipeline now works like this. Before generating any prompt, we go to Google’s keyword data — the same source SEO teams have been using for fifteen years — and pull the actual queries buyers in your category run, along with their monthly search volume.

For Drata, that surfaces queries like:

  • “drata vs vanta” — 260 searches/month
  • “compliance automation tools” — 880 searches/month
  • “soc 2 audit software” — 720 searches/month
  • “vanta alternatives” — 1,300 searches/month

We feed those real queries to the prompt generator and constrain it to mirror them — same intent, same hypotheticals, but expressed as a buyer would phrase it to ChatGPT. Each prompt carries the source query and its volume with it through the rest of the pipeline.

What it looks like in the product

We added a Volume column to the prompts table. It’s sortable. Click the header and your highest-attention prompts float to the top. The prompts that mirror “vanta alternatives” — a 1,300/month query — sort above prompts mirroring lower-volume searches.

Hover a number and the tooltip reads in plain English: “Mirrors the Google query ‘drata vs vanta’ which gets 260 searches/month. Buyers are asking AI this question right now.”

A new Sources column shows the source query as a chip; click it to open the Google search and verify for yourself.

What about new companies with no search volume?

The first version had a flaw: a brand-new company with no measurable search volume yet would get an empty benchmark. That’s a bad first experience.

So we built a tiered fallback:

  1. Direct grounding — your brand’s own queries (best signal).
  2. Category grounding — your category’s queries when your brand doesn’t have measurable volume yet (still better than nothing).
  3. Synthesized — clearly labeled as such in the UI when neither tier has data, so customers know exactly what they’re looking at.

The honesty here matters more than the cleverness. When a prompt is synthesized, we say so — we don’t quietly borrow a competitor’s search source and pretend it applies. A KnitKnot benchmark is now something a customer can show to a skeptical buyer and say “every question in here is grounded in a real search query — here’s the volume.”

The deeper shift

This rebuild changed how we think about KnitKnot’s job. We used to think we were measuring “how AI talks about you.” That’s still true, but it’s incomplete. The complete framing is: measuring how AI answers the questions buyers are actually asking. The “actually asking” part is load-bearing. Without grounding, you’re measuring AI behavior on questions no one cares about.

We expect this to be the methodology line that gets cited in every sales conversation from now on: “Every prompt is symmetric and grounded in a real Google query with monthly volume. Read it yourself.”

That’s what good benchmarking looks like. Thanks to the Drata team for the kick.

What’s next

The current grounding source is Google search volume via DataForSEO. The next layer we’re working on is Reddit search behavior — what buyers ask each other in unmoderated forums often diverges from what they type into Google. When we add Reddit grounding, we’ll publish a follow-up here.

If you want a benchmark for your category — symmetric, grounded, defensible — book a slot.