Prompt libraries are coverage optimization problems
· 5 minute read
Max Wiesner
Co-founder, KnitKnot
The library looked full
A prompt count is not coverage. It’s inventory.
The first version of our prompt library had the shape every AI evaluation product eventually drifts toward: generate a lot of questions, group them into categories, run them across models, call the result coverage. It looked serious. Brand perception prompts. Landscape prompts. Head-to-head comparisons. Feature-specific prompts. Persona prompts. If you opened the table, the count was high.
Then we started reading the actual results.
Some competitors showed up in every other prompt. Some features barely appeared at all. Some buyer personas existed in the company research but never made it into the active benchmark. A product with only a few obvious competitors would run out of head-to-head prompts, so the library quietly backfilled with broad landscape questions nobody actually searches for.
The benchmark looked complete because the number was high. But the measurement surface had holes.
Coverage is multi-dimensional
If a buyer is comparing you to Datadog for incident response, that’s one measurement cell. If a VP of Engineering is comparing you to Datadog for audit logging, that’s another. If a security lead is comparing you to Datadog because they need SOC 2 evidence by next quarter, that’s a third.
Those aren’t paraphrases. They’re different buyer situations. The model may recommend you in one and the competitor in another. It may know you support a feature but fail to mention it for the persona who cares most.
Once you see that, “generate 100 good prompts” stops being the right objective. The objective is to cover the buyer-decision space. That space has at least five dimensions:
Competitor. A benchmark that only tests one rival isn’t useful if your sales team hears five names in discovery.
Feature. “Acme vs Widgetly” is too broad. Buyers ask about deployment, governance, pricing, integrations, migration. If the library misses the features where AI is confused, the benchmark misses the reason you lose.
Persona. The same product gets evaluated by a founder, a CISO, a data engineer, and a RevOps lead. They ask different questions because they’re buying different risk reductions.
Demand. Some prompts map to real Google searches with measurable volume. Some are synthesized to test an uncovered edge. Those shouldn’t be treated as equivalent.
Subject scope. If a company sells multiple products, a company-wide prompt library collapses too much. The AI may know the parent brand but confuse the products.
A library can have hundreds of prompts and still be bad. It can be dense in the wrong places.
What we did about it
We flattened the library. No more prompt sets, categories, or template families. A prompt is a row with structured tags: competitor, feature, persona, product, search volume, source. Once it’s a row with dimensions, you can ask better questions: which competitors are under-covered? Which features have no active prompt? Which personas exist in research but never appear in the benchmark? A Library Health view answers these continuously, and rebuilds preview their changes before anything is committed.
We also made generation keyword-first. Prompts are built around real Google search data, with measured search volume attached to each prompt, rather than generated freely and volume-checked later. The demand dimension stops being an afterthought and becomes the seed. In practice that lands at roughly 50 prompts per subject, around 300 per workspace, and each product line gets its own subject-scoped library so the parent brand and its products are benchmarked separately.
Then we set an explicit composition target. For B2B AI presence, the most valuable prompts are head-to-head comparisons. That’s where recommendations happen. So the active library is roughly 90% head-to-head and 10% landscape.
That ratio matters because without it, the library follows the path of least resistance. If broad category prompts are easier to generate, they crowd out comparisons. If one competitor has more search volume, it monopolizes the active set.
We hit this directly. A thin product with only two real competitors would run out of distinct comparisons and backfill with generic landscape prompts. The benchmark looked healthy by count, but it wasn’t testing buyer moments anymore.
The fix was to generate from the full cross-product of dimensions:
Acme vs Widgetly
Acme vs Widgetly for deployment controls
Acme vs Widgetly for a security lead
Acme vs Widgetly for deployment controls as a security lead
That’s not prompt spam. Even three competitors, five features, and six personas creates 90 distinct buyer situations without repeating the same generic comparison.
Finally, we rank before activating. Real search volume beats no volume. Starred manual prompts stay active. Competitors fill evenly instead of letting the highest-volume rival take every slot. Semantic deduplication removes prompts that are different strings but the same buyer question. And every prompt gets an activation reason: starred, competitor, landscape, or over budget. If a customer sees a weird benchmark question, they can tell where it came from.
Why this matters
AI benchmarks are fragile because the prompt is half the measurement. If the prompts are too adversarial, customers reject the benchmark as rigged. If the prompts are too generic, the benchmark never finds the real gaps. If they cluster around one competitor, the score overfits to one sales motion.
The fix isn’t more prompts. It’s better allocation.
We stopped asking “how many prompts do we have?” and started asking “what buyer situations are still uncovered?” That’s the question that makes a benchmark useful.