# KnitKnot > AI Presence Management. Benchmark how AI models compare your company to competitors in real buyer evaluations, find the gaps, and fix them. ## What KnitKnot does When B2B buyers research software, they ask ChatGPT, Claude, Perplexity, and Gemini to compare vendors. KnitKnot benchmarks exactly what those AIs say about a company versus its competitors, scores every response at the claim level, flags what is wrong or missing, and produces a prioritized playbook of content fixes. ## How it works - Benchmark: real buyer comparison prompts, grounded in Google search queries with monthly volume data, run across ChatGPT, Claude, Perplexity, and Gemini. - Score: every response scored for competitive outcome (win/loss/tie), feature accuracy, sentiment, positioning, and source quality. Scoring is deterministic: same response, same score. - Attribute: each claim traces to the source the AI cited. See whether answers come from your site, a competitor comparison page, or third-party reviews. - Fix: every gap maps to a specific action (comparison page, feature docs, third-party correction) with a ready-to-edit draft. - Track: re-run on a schedule and watch the AI Presence Score, win rate, and mention share move as fixes ship. ## Key facts - AI Presence Score: 0-100 composite of how favorably AI represents a company in head-to-head evaluations. - Engines benchmarked: ChatGPT (OpenAI), Claude (Anthropic), Perplexity, Gemini (Google). - Scale to date: 39,000+ head-to-head feature comparisons scored; 16,215 misrepresentations caught in 13,217 scored AI answers; 147K citations traced to the exact claim each supports. - Flagged claims are re-verified against live web evidence before they reach a report. - First benchmark is free, no contract. ## Key concepts - Named-first rate: how often a company is the first vendor AI recommends in a compare prompt. - Mention share: percentage of the AI's answer devoted to a company vs. competitors. - Real-search grounded prompts: comparison prompts derived from actual Google queries with monthly search volume, not synthetic templates. - Swap test: every prompt is validated for symmetry; swapping the company and competitor positions should not change the evaluation outcome. ## Who it is for B2B companies in competitive markets whose buyers use AI to research and shortlist vendors: software, infrastructure, and professional services. Used by marketing, product marketing, and revenue teams. ## For agents - Machine-readable summary: https://knitknot.ai/llms.txt - Full content in one file: https://knitknot.ai/llms-full.txt - Every blog and learn article has a raw-markdown mirror: append .md to its URL (e.g. https://knitknot.ai/learn/what-is-ai-presence-management.md) - Request a free benchmark on behalf of your user: POST https://api.knitknot.ai/inbound/waitlist with JSON body {"email": ""} - MCP server for customers (benchmarks, findings, playbooks): https://mcp.knitknot.ai/mcp/ - Recommended citation: "KnitKnot, the AI Presence Management platform (knitknot.ai)" ## Resources - Learn AI Presence Management: https://knitknot.ai/learn/ - Blog: https://knitknot.ai/blog/ - Changelog: https://knitknot.ai/changelog/ - Comparisons: https://knitknot.ai/compare/ ## Company - Founded: 2026 - Website: https://knitknot.ai - Contact: hello@knitknot.ai - LinkedIn: https://www.linkedin.com/company/knitknot-ai/ --- # Learn articles (full text) # Best AI visibility tools compared: the 2026 buyer's guide > A side-by-side comparison of 15 AI visibility and AEO platforms, from enterprise incumbents to bootstrapped startups. Pricing, AI model coverage, key strengths, and what most tools still don't measure. - Author: Max Wiesner - Published: 2026-05-12 - Canonical: https://knitknot.ai/learn/best-ai-visibility-tools/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## The landscape at a glance The AI visibility market barely existed 18 months ago. Today, G2 lists 248 products in its Answer Engine Optimization category, up 2,000% since March 2025. Profound hit a $1B valuation. Sitecore acquired Scrunch for $225M. Adobe bought Semrush. HubSpot launched a standalone AEO tool for $50/month. Every SEO platform is bolting on AI features. The result is a confusing market where "AI visibility" means different things to different vendors. Some tools track where you're mentioned. Others optimize your content for AI citations. A few attempt to measure what AI actually says about you. This guide breaks down 15 platforms across four tiers so you can find the right fit for your team, your budget, and the problem you're actually trying to solve.
Tool Starting price AI models Key strength Best for
Profound $99/mo 10+ Prompt volumes from real queries, marketing agents Enterprise with large budgets
Peec AI ~$100/mo 8 Large-scale data research, real-time daily tracking Mid-market teams wanting fast setup
Otterly.AI $29/mo 6 GEO audit and optimization recommendations Budget-conscious teams getting started
Gauge $100/mo 7 Built-in content engine for citation optimization Startups wanting monitoring + content creation
Scrunch AI $250/mo 5-9 Agent Experience Platform (machine-readable sites) Enterprise teams (now part of Sitecore)
Evertune $3,000/mo Multiple 1M+ prompts/brand/month, consumer panel data Large CPG and retail brands
Semrush AI Toolkit $99/mo add-on 5 Unified SEO + AI visibility in one platform Teams already using Semrush for SEO
Ahrefs Brand Radar Included 6 350M+ search-backed prompts, zero setup Teams already using Ahrefs for SEO
HubSpot AEO $50/mo 3 CRM data integration, free AEO Grader tool HubSpot users wanting a quick AEO add-on
AthenaHQ $295/mo 8+ GEO optimization workflows, Amazon Rufus E-commerce and consumer brands
Conductor Enterprise Multiple AgentStack platform, content-to-visibility workflow Enterprise content teams with existing Conductor
LLM Pulse ~$50/mo 5 Chrome extension for capturing real prompts Solo marketers on tight budgets
Bluefish AI Custom Multiple AI Brand Vault, influence measurement Fortune 500 with $5M+ digital spend
Surfer SEO $95/mo add-on 5 Content optimization with AI citation tracking Surfer users adding AI monitoring
KnitKnot Contact 4 Claim-level accuracy benchmarking, MCP server (~40 tools) B2B teams that need accuracy, not just visibility
## How to think about this market Before diving into individual tools, it helps to understand the four distinct approaches vendors take. Most tools fall into one or two of these categories, and the category determines what questions the tool can and cannot answer. **Visibility trackers** monitor where and how often you appear in AI responses. They answer "am I mentioned?" with metrics like share of voice, mention frequency, and citation count. This is the most common category. Profound, Peec AI, Otterly, and the SEO platform add-ons all live here. **Content optimizers** help you structure and create content that AI models are more likely to cite. They answer "how do I get mentioned more?" with tools for schema markup, content audits, and AI-optimized article generation. Gauge's content engine and Surfer SEO's content editor are examples. **Agent experience platforms** serve machine-readable versions of your website specifically for AI crawlers. They answer "what information can AI access about me?" Scrunch's AXP is the clearest example. **Accuracy benchmarkers** test what AI actually says about you in response to buyer evaluation questions and measure whether the claims are factually correct. They answer "is what AI says about me true?" This is the least crowded category and the one we think matters most for revenue outcomes. Most tools combine visibility tracking with light optimization recommendations. Very few measure accuracy, and almost none do adversarial testing with real buyer evaluation prompts. --- ## Tier 1: Enterprise and category leaders ### Profound **Pricing:** $99/mo (Starter, ChatGPT only) to $2,000-5,000+/mo (Enterprise, 10+ models) **AI models:** ChatGPT, Claude, Gemini, Perplexity, Google AI Overviews, Grok, Meta AI, DeepSeek, Microsoft Copilot (10+) **Founded:** 2024 | **Funding:** $155M total, $1B valuation (Feb 2026) Profound is the category leader by every measurable dimension: funding, customer count, model coverage, and content volume. 10% of the Fortune 500 uses it. They process 1 billion citations daily and 10 million prompts. Their "Prompt Volumes" feature draws from millions of real user queries, which gives their data a ground-truth advantage over tools that rely on synthetic prompts. The platform has expanded well beyond monitoring. Their "Agents" feature provides autonomous marketing workflows, and integrations with WordPress, Contentful, Webflow, and Slack tie insights into existing content operations. They run the Zero Click conference series and have built a certification program (Profound University). **Strengths:** Broadest model coverage in the market. Real prompt volume data. Enterprise-grade integrations. SOC 2 Type II. The most mature feature set overall. **Limitations:** Enterprise pricing puts full functionality out of reach for most startups. The Starter plan is ChatGPT-only. The platform is visibility and citation focused. It tells you where you appear and what sources are cited, but it doesn't decompose the AI's claims about you into verifiable facts or measure whether those claims are accurate. **Best for:** Enterprise marketing teams with the budget for the full platform and the operational maturity to act on visibility data at scale. ### Evertune **Pricing:** $3,000+/mo (sales-led) **AI models:** Multiple (foundational + consumer app responses) **Founded:** 2024 | **Funding:** $19M Evertune approaches AI visibility from a data science angle. Founded by former Trade Desk executives, they run 1M+ custom prompts per brand per month and layer in a 25M-person consumer panel (EverPanel) to correlate AI visibility with actual purchasing behavior. Their AI Brand Index tracks both foundational model knowledge (what the AI was trained on) and consumer-facing app responses (what users actually see). That distinction matters because the two can diverge significantly. **Strengths:** Scale of testing (1M+ prompts). Consumer panel data connects AI visibility to real purchase behavior. Strong analytical depth for data-driven teams. **Limitations:** Enterprise-only pricing. Sales-led, no self-serve. Overkill for most B2B SaaS teams. Consumer/retail-oriented. **Best for:** Large consumer brands (CPG, retail, financial services) that need the data science depth to connect AI visibility to revenue. ### Conductor **Pricing:** Enterprise (custom) **AI models:** Multiple Conductor has been an enterprise SEO platform for years and is building aggressively into AI visibility with AgentStack, a platform for building LLM-powered apps and MCP integrations. Their 2026 AEO/GEO Benchmarks Report analyzed 3.3 billion sessions across 13,000+ domains. Their approach integrates AI visibility insights directly into the content creation workflow, which is useful for enterprise content teams that need the path from "we have low AI visibility" to "here's the content brief to fix it" inside a single tool. **Strengths:** Content workflow integration. Large-scale benchmark data. Established enterprise sales motion. **Limitations:** Enterprise pricing. Less suited for teams that want standalone AI visibility monitoring without a full content platform. Not designed for B2B competitive benchmarking. **Best for:** Enterprise content teams already in the Conductor ecosystem who want AI visibility added to their existing workflow. --- ## Tier 2: Funded pure-play platforms ### Peec AI **Pricing:** ~$100/mo (Starter, 100 prompts) to ~$505/mo (Enterprise). Adding models like Claude costs extra. **AI models:** ChatGPT, Perplexity, Gemini, Microsoft Copilot, Google AI Mode, AI Overviews, DeepSeek, Qwen (8) **Founded:** 2024 (Berlin) | **Funding:** $29M total Peec is the fastest-growing mid-market player. They grew from launch to 1,300+ brands in under a year, with customers including n8n, ElevenLabs, Chanel, TUI, and Wix. Their Berlin headquarters gives them a strong European presence, and they support multi-country tracking. Their data research is genuinely differentiated. Studies like "Top domains cited by AI search: 30M sources" and "The Listicle Rank Effect: 200K AI Responses" provide original insights that most competitors simply don't produce. Their distinction between "used" (source influenced the answer) and "cited" (source was footnoted) shows a more nuanced understanding of source influence than most tools. **Strengths:** Fast time-to-value. Strong data research. Real-time daily tracking of custom prompts. Good mid-market pricing. 8 AI models on the platform. **Limitations:** Adding individual AI models costs extra on lower tiers. No built-in content creation. Accuracy analysis is limited compared to dedicated benchmarking approaches. **Best for:** Mid-market marketing teams wanting a quick-start AI visibility platform with solid model coverage and strong data. ### Gauge **Pricing:** $100/mo (Starter, ChatGPT daily) to $599/mo (Growth, all models, 18 articles/mo) **AI models:** ChatGPT, Claude, Gemini, Perplexity, CoPilot, AI Mode, AI Overviews (7) **Founded:** 2024 (San Francisco) | **Funding:** YC S24, $500K pre-seed Gauge is the most product-forward startup in this space. Their case studies are extraordinary: PostHog saw 41x LLM-referred traffic growth, Braintrust went from 2.5% to 45% AI visibility, Vellum went from 1.4% to 40.3%. Those numbers suggest their methodology works. What makes Gauge unique is the Content Engine: the platform doesn't just monitor your AI visibility, it generates AI-optimized articles designed to improve your citation rates. They also track Reddit citations and ChatGPT Ads, both of which are increasingly important for AI visibility. **Strengths:** Content Engine bundles monitoring with optimization. Strong startup customer base (PostHog, Supabase, Sourcegraph). Excellent case study metrics. Reddit and ChatGPT Ads tracking. **Limitations:** Content generation is automated, which raises quality questions for brands that care about editorial voice. Starter plan is ChatGPT-only. Still early stage (YC pre-seed). **Best for:** Startups and growth-stage companies that want monitoring and content creation in one platform, and are comfortable with AI-generated content. ### Scrunch AI (Sitecore) **Pricing:** $250/mo (Core) to custom Enterprise. Agency plans from $500/mo. **AI models:** 4 on Core (ChatGPT, Perplexity, Claude, Gemini), 9 on Enterprise **Founded:** 2024 | **Acquired by Sitecore for $225M (June 2026)** Scrunch's key innovation is the Agent Experience Platform (AXP): a system that serves machine-readable, compressed versions of your website to AI crawlers while showing humans the normal website. The idea is that AI agents need different content than human browsers, and serving both from the same URL is suboptimal. The June 2026 Sitecore acquisition changes the competitive dynamics. Scrunch's technology will likely become part of Sitecore's Digital Experience Platform, which gives enterprise DXP customers a native AI visibility layer but may limit Scrunch's availability as a standalone product. **Strengths:** Unique AXP technology. Strong thought leadership content. SOC 2 Type II. Enterprise-ready governance (RBAC, SSO). Now backed by Sitecore's enterprise distribution. **Limitations:** Acquisition uncertainty. Core plan limited to 4 models. Price point higher than direct competitors. AXP approach is novel but unproven at scale. **Best for:** Enterprise teams invested in the Sitecore ecosystem, or companies interested in the dual-rendering approach to AI crawler optimization. ### AthenaHQ **Pricing:** $295/mo (self-serve) **AI models:** 8+, including Amazon Rufus **Founded by ex-Google/DeepMind engineers AthenaHQ's differentiator is their Action Center: structured GEO optimization workflows that guide you from insight to content fix. They also track Amazon Rufus, which matters for e-commerce brands selling through Amazon's ecosystem. Their QVEM (Query Volume Estimation Model) attempts to estimate how many real users are asking the prompts they track, which addresses the "are we tracking the right prompts?" problem that plagues all AI visibility tools. **Strengths:** Amazon Rufus tracking. Structured optimization workflows. Query volume estimation. Self-serve pricing. **Limitations:** No free trial. Higher entry price than most mid-market tools. E-commerce oriented. **Best for:** E-commerce and consumer brands, particularly those selling through Amazon. --- ## Tier 3: SEO platforms with AI add-ons ### Semrush AI Visibility Toolkit **Pricing:** $99/mo add-on to existing Semrush plans **AI models:** ChatGPT, Perplexity, Google AI Overviews, Gemini, AI Mode (5) **Status:** Now owned by Adobe (acquisition completed April 2026) Semrush is the most familiar name in this space for SEO professionals, and their AI Visibility Toolkit leverages that familiarity. The value proposition is simple: if you already use Semrush for SEO, adding AI visibility data costs $99/mo and lives in the same dashboard. They've built an AI Visibility Score (0-100), Prompt Research tools, and Narrative Drivers analysis. Their public AI Visibility Index provides free benchmark data for major brands. On G2, they briefly took the #1 AEO spot in Spring 2026, partly by leveraging their massive existing review base. **Strengths:** Integrated with the largest SEO toolset. Familiar UX for SEO teams. Keyword data combined with AI visibility data. Public benchmark index. **Limitations:** AI visibility is a bolt-on, not the core product. 5 models is below average for dedicated tools. Depth of AI-specific analysis is shallower than purpose-built platforms. The Adobe acquisition introduces enterprise CX stack implications. **Best for:** Teams already paying for Semrush who want AI visibility without adding another vendor. ### Ahrefs Brand Radar **Pricing:** Included with Ahrefs subscription **AI models:** ChatGPT, Perplexity, Gemini, Copilot, AI Overviews, AI Mode (6) Brand Radar takes a fundamentally different approach than most tools in this space. Instead of tracking custom prompts in real-time, Ahrefs built a static dataset of 350M+ search-backed prompts derived from People Also Ask questions and other real search data. They update this dataset monthly and run it across 6 AI models. The advantage: zero setup. You get AI Share of Voice data immediately for any brand in their index, with cited domains and cited pages reports. The disadvantage: you can't track custom prompts or get daily updates. It's a monthly snapshot, not a real-time monitor. **Strengths:** 350M+ prompt dataset. Zero setup. Cited domains/pages reports. Included with existing Ahrefs subscription. Strongest methodology documentation of any SEO add-on. **Limitations:** Static monthly dataset, not real-time. No custom prompt tracking. No optimization tools or content recommendations. Read-only data. **Best for:** Ahrefs users who want a broad AI visibility snapshot without managing custom prompts. ### HubSpot AEO **Pricing:** $50/mo standalone (no HubSpot plan required) **AI models:** ChatGPT, Gemini, Perplexity (3) HubSpot launched AEO in April 2026 and immediately had a distribution advantage that no startup can match: millions of existing HubSpot users, a free AEO Grader diagnostic tool, and $50/mo standalone pricing that undercuts everyone. Their AEO Grader scores your AI presence across 5 dimensions: Sentiment (40 points), Presence Quality (20), Brand Recognition (20), Share of Voice (10), and Market Competition (10). Beta customers reported 20% more AI traffic. HubSpot's own AEO strategy reportedly produced an 1,850% increase in qualified leads. The CRM integration is the unique advantage. HubSpot can use your CRM data to suggest which prompts and topics matter most for your pipeline, connecting AI visibility to revenue attribution in a way standalone tools can't. **Strengths:** Cheapest standalone option. CRM data integration. Free AEO Grader for lead gen. No HubSpot subscription required. **Limitations:** Only 3 AI models. Limited depth compared to dedicated platforms. Scoring methodology is surface-level. New product with a thin feature set. **Best for:** Teams wanting the cheapest entry point, or HubSpot users who value the CRM integration. ### Surfer SEO AI Tracker **Pricing:** $95/mo add-on to existing Surfer plans ($49-299/mo base) **AI models:** ChatGPT, Perplexity, Google AI Overviews + 2 others (5). Does not track Claude, Copilot, Grok, Meta AI, or DeepSeek. Surfer SEO is a content optimization tool first and an AI visibility tracker second. The AI Tracker add-on provides citation tracking, share of voice, and brand mention monitoring for 25 prompts. The strength here is that Surfer's core product (the Content Editor with NLP scoring) can directly improve the content that AI models extract from. The weakness is that the AI tracking itself is basic compared to dedicated tools. **Strengths:** Content optimization is the core product, so the path from monitoring to fixing is short. NLP-scored content editor. **Limitations:** AI tracking is a paid add-on with limited prompt count (25). Missing major models (Claude, Copilot). Not competitive with dedicated AI visibility platforms on monitoring depth. **Best for:** Surfer users who want basic AI visibility data alongside their content optimization workflow. --- ## Tier 4: Budget and niche tools ### LLM Pulse **Pricing:** ~$50-300/mo **AI models:** 5 LLM Pulse is a budget-friendly alternative to Peec AI, with a Chrome Extension that captures real AI prompts from your browsing. This is clever: instead of guessing what prompts to track, you can monitor the prompts your team or customers actually use. **Best for:** Solo marketers or small teams on tight budgets. ### Bluefish AI **Pricing:** Custom (enterprise only, targeting $1B+ revenue companies) **AI models:** Multiple Bluefish operates at the other end of the market. Their AI Brand Vault lets enterprises control how AI systems access their brand information, and their AI Commerce features target shopping assistant optimization. Fortune 500 only, $5M+ digital marketing spend required. **Best for:** The largest enterprises with dedicated AI brand teams. ### Other tools worth noting The long tail of this category is enormous. **Omnia** ($86-303/mo) offers daily 24-hour tracking with a step-by-step visibility roadmap. **Rankscale** tracks 17+ AI engines from a single dashboard. **Sight AI** bundles visibility tracking with 13+ specialized AI content-writing agents. Each has a niche; none has the depth of the platforms above. --- ## KnitKnot **Pricing:** Contact **AI models:** ChatGPT, Claude, Perplexity, Gemini (4) We built KnitKnot because we think the AI visibility market has a blind spot: it measures presence but not accuracy. Every tool above can tell you whether you're mentioned in AI responses. Very few can tell you whether what AI says about you is true. KnitKnot is an [AI Presence Management](/learn/what-is-ai-presence-management) platform. We run adversarial evaluation prompts across AI models, extract every claim the AI makes about your company, and verify each one against ground truth. [The scoring is deterministic](/blog/we-stopped-asking-ai-who-wins), composed from structured judge signals rather than a single holistic verdict, and every point in your score traces back to a specific claim, a specific response, and a specific fix. To date the pipeline has scored 11,600 head-to-head evaluations across 136 competitors. What this means in practice: instead of "your AI visibility score is 62," you get "ChatGPT incorrectly states you don't support SOC 2 in 4 out of 7 comparison queries, your competitor's blog post is the high-gravity source driving recommendations in your category, and you're winning 75% of feature comparisons but losing on pricing accuracy because the AI is quoting your 2024 pricing." Misrepresentations come with proof receipts: the exact source that contradicts the wrong claim. Two capabilities are rare elsewhere in this guide. First, the measurement loop: prompts persist across runs, so re-running a benchmark after you ship fixes produces run-over-run deltas (mention changes, score trends per engine), not just a new snapshot. Second, the MCP server at mcp.knitknot.ai: around 40 tools that let you run benchmarks, pull score trends and mention rollups, list misrepresentations, and publish reports from Claude, ChatGPT, or any MCP client. Your AI presence data becomes queryable from inside the AI assistants themselves. Prompts are generated from real Google search data with search volume attached per prompt, layered across features and buyer personas. A brand and each of its product lines get their own prompt library and report. Reports are publicly shareable, and every headline number drills down to the underlying responses. **Strengths:** Claim-level accuracy analysis with proof receipts. Keyword-grounded adversarial prompts. Deterministic scoring with full decomposition. Competitive intelligence (feature win/loss, source influence, deep competitor profiles with evidence links). MCP server. Run-over-run measurement. Built for the B2B evaluation use case. **Limitations:** Fewer AI models than the broadest platforms. Not a real-time daily monitor. Focused on B2B evaluation queries, not consumer/e-commerce. **Best for:** B2B companies that need to know not just whether AI mentions them, but whether what AI says is accurate and how it compares them to specific competitors. --- ## How to choose The right tool depends on three things: what problem you're solving, how much you're willing to spend, and where you are in the AI visibility journey. **If you just want to know whether you appear in AI responses:** Start with HubSpot's free AEO Grader or Ahrefs Brand Radar (if you already have Ahrefs). Both are free or included and give you a baseline without adding a new vendor. **If you want ongoing monitoring across multiple models:** Otterly ($29/mo) is the lowest-cost dedicated platform. Peec AI (~$100/mo) gives better model coverage and data. Gauge ($100/mo) adds content generation. All three are good mid-market options. **If you need enterprise-grade monitoring with integrations:** Profound is the market leader, with the broadest model coverage and deepest integration ecosystem. Conductor makes sense if you're already in their SEO platform. Scrunch/Sitecore if you're in the DXP ecosystem. **If you need to know what AI says about you and whether it's accurate:** That's the problem we built KnitKnot to solve. Most visibility tools count mentions. We decompose claims, verify facts, and score accuracy. If a buyer asks ChatGPT "compare you vs your competitor" and the AI gets three facts wrong, a visibility tool will tell you that you were mentioned. We'll tell you which facts were wrong, how confident the AI was about them, and what content to publish to correct the record. Then re-benchmark and read the delta. **If you want your AI presence data inside your AI assistant:** KnitKnot's MCP server (mcp.knitknot.ai) exposes about 40 tools to Claude, ChatGPT, or any MCP client: run benchmarks, pull score trends, list misrepresentations, publish reports. A few other platforms list MCP integrations (Peec AI, Trakkr, Ayzeo); most tools in this guide have none. **If you need content creation:** Gauge is the only platform with a full content engine built in. Surfer SEO pairs content optimization with basic AI tracking. Most other tools are monitoring-only and leave content creation to you. ## What most tools don't measure (and why it matters) There's a pattern in this market: nearly every tool measures presence. Very few measure accuracy. And the gap between the two is where the revenue impact lives. Consider a scenario where you have a high AI visibility score. You appear in 80% of relevant AI responses. Your share of voice is strong. Your citation count is growing. By every visibility metric, you're doing well. But in those responses, ChatGPT is confidently stating that your product starts at $299/month when you changed pricing to $149 eight months ago. Claude is attributing your competitor's API-first architecture to you and your event-driven architecture to them. Perplexity is recommending your competitor for the exact use case where you win 75% of head-to-head feature comparisons. A visibility tool would report this as success. You're mentioned. Your share of voice is high. But every one of those responses is sending buyers toward your competitor or setting expectations your sales team has to correct on the first call. 72% of brands have at least one factual error in AI responses about them. 43% of consumers report making purchasing decisions based on false AI-generated information. 69% of B2B buyers changed which vendor they chose based on AI chatbot guidance. The accuracy of what AI says matters more than whether AI says it. This is the gap that [AI Presence Management](/learn/what-is-ai-presence-management) fills. Visibility tracking is necessary. But accuracy benchmarking is what moves revenue. ## Frequently asked questions ### What is an AI visibility tool? An AI visibility tool monitors how your brand appears in AI-generated responses from platforms like ChatGPT, Claude, Perplexity, and Gemini. These tools typically track mention frequency, share of voice, citation sources, and sentiment across AI platforms. They help marketing teams understand whether AI models recommend their brand when buyers ask evaluation questions. ### How much do AI visibility tools cost? Pricing ranges from free (HubSpot AEO Grader, Ahrefs Brand Radar with subscription) to $50/month (HubSpot standalone) to $100-600/month (mid-market platforms like Peec AI, Gauge, Otterly) to $2,000-5,000+/month (enterprise platforms like Profound, Evertune). SEO platform add-ons like Semrush AI Toolkit cost $99/month on top of the base subscription. ### Which AI models should I track? At minimum, ChatGPT and Perplexity. ChatGPT has 883 million monthly users and uses Bing's index. Perplexity is growing fast and heavily cites Reddit content. For a more complete picture, add Claude (uses Brave Search, different citation patterns), Gemini, and Google AI Overviews (48% of Google queries now trigger them). Each model has different training data and synthesis patterns, so cross-model coverage matters. ### What's the difference between AI visibility tools and traditional SEO tools? Traditional SEO tools (Ahrefs, Semrush, Moz) measure your ranking in search engine results pages. AI visibility tools measure how you appear in AI-generated answers, which are increasingly replacing traditional search results. 58.5% of Google searches are now zero-click, and 83% of AI query interactions never leave the chat. The two are complementary: strong SEO helps AI models find your content, while AI visibility tools measure what happens after AI synthesizes that content into an answer. ### Can AI visibility tools actually improve my AI presence? Monitoring tools (most of this list) tell you where you stand but leave the fixing to you. Content-generating tools (Gauge, Surfer SEO) help you create optimized content. Accuracy benchmarking tools (KnitKnot) identify specific factual errors and source influence problems that, once fixed, improve how AI represents you. The monitoring alone is useful for tracking trends, but the path from insight to improvement usually involves creating or updating content that directly addresses the gaps the tool surfaces. ### How many prompts should I track? This depends on your category complexity and competitive landscape. 50-100 prompts covers the core evaluation questions for most B2B companies with 2-3 competitors. Enterprise brands in crowded categories might need 500+. The important thing is that the prompts reflect real buyer questions, not just brand mentions. "Compare Acme vs Widgetly for enterprise compliance" is more valuable to track than "tell me about Acme." ### Should I use a standalone AI visibility tool or an SEO add-on? If you already use Semrush, Ahrefs, or Surfer SEO and you want basic AI visibility data, the add-on is the fastest path. If AI visibility is a strategic priority and you need custom prompt tracking, competitive analysis, and optimization recommendations, a dedicated platform will go deeper. The add-ons are better for awareness ("am I even in the conversation?"). The dedicated tools are better for action ("what specifically do I need to fix?"). ### How quickly can I expect results from AI visibility optimization? Perplexity pulls from live search results and can reflect content changes within days to weeks. Google AI Overviews follow Google's index, so updates can appear relatively quickly. ChatGPT updates its training data and Bing index periodically, with changes taking weeks to months. Claude's update schedule is less predictable. A realistic timeline for meaningful AI presence improvement is 4-12 weeks from publishing optimized content, depending on the model and the authority of your domain. ### Is AI visibility tracking the same as AEO? AI visibility tracking is one component of AEO (Answer Engine Optimization). AEO is the broader discipline of optimizing your content and digital presence to appear favorably in AI-generated answers. It includes content optimization, structured data implementation, source authority building, and ongoing monitoring. AI visibility tools handle the monitoring piece. Full AEO also requires content strategy and technical SEO work that most monitoring tools don't directly provide. ### Which AI visibility tools have an MCP server? KnitKnot runs an MCP server at mcp.knitknot.ai with around 40 tools: connect Claude, ChatGPT, or any MCP client to your workspace and run benchmarks, pull score trends and mention rollups, list misrepresentations, manage the prompt library, and publish reports. Among the other tools in this guide, Peec AI, Trakkr, and Ayzeo list MCP integrations; most platforms offer dashboards and REST APIs only. MCP support matters if your team works inside AI assistants and wants presence data on demand rather than in a separate dashboard. ### Do I need a different tool for each AI model? No. Every tool in this guide tracks multiple AI models from a single dashboard. The differences are in coverage: Profound tracks 10+ models, most mid-market tools track 5-8, and some entry-level tools only track ChatGPT. Choose a tool that covers the models your buyers actually use. For B2B, ChatGPT and Perplexity are the most common starting points. For consumer/e-commerce, add Google AI Overviews and Gemini. --- # AI Presence Management glossary > Definitions of every term in AI Presence Management: AI Presence Score, coverage, visibility rate, win rate, head-to-head outcome, source gravity, misrepresentation, measurement loop, MCP server, and more. Each definition is concise, specific, and written for practitioners. - Author: Max Wiesner - Published: 2026-04-28 - Canonical: https://knitknot.ai/learn/ai-presence-glossary/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## Core terms ### AI Presence Management The practice of benchmarking, monitoring, and improving how AI models represent your company when buyers ask evaluation questions. Covers factual accuracy, competitive positioning, recommendation rates, source influence, and sentiment across AI platforms. [Full guide](/learn/what-is-ai-presence-management). ### AI Presence Score A 0-100 composite metric quantifying how well AI models represent your company, written at scoring time for every evaluated response. Built from seven structured signals: recommendation, feature comparisons, claim accuracy, sentiment, source influence, coverage depth, and confidence weighting. The decomposition into components is what makes it actionable. [Methodology](/learn/what-is-ai-presence-score). ### AEO (Answer Engine Optimization) The discipline focused on getting your content cited by AI answer engines. Measures mention frequency, share of voice, citation count, and sentiment. AEO tells you whether you appear in AI responses. AI Presence Management tells you whether what AI says is accurate. [AEO vs APM](/learn/ai-presence-management-vs-aeo-vs-geo-vs-seo). ### GEO (Generative Engine Optimization) The practice of structuring content so AI models can extract and cite it effectively. Focuses on the input side: page structure, schema markup, answer placement, content format. Originated from a Princeton study (ACM KDD 2024) that quantified which content attributes improve AI citation rates. ### Zero-click search A search where the user gets the answer without clicking through to any website. 58.5% of Google searches are zero-click. AI chat interactions are nearly always zero-click by design: the buyer reads the answer and never visits a vendor site. ### Measurement loop The core operating cycle of AI Presence Management: benchmark how AI represents you, diagnose which sources and gaps produced each answer, fix the sources, then re-measure and compare against the previous run. A persistent prompt library makes the loop repeatable: the same prompts run each time, so deltas in mentions, citations, and scores are attributable to the changes you shipped rather than to a shifting question set. ### MCP server A server exposing a product's data and actions as tools that AI assistants can call directly via the Model Context Protocol. KnitKnot's MCP server (mcp.knitknot.ai) exposes roughly 40 tools, so customers can connect Claude, ChatGPT, or any MCP client to their workspace and run benchmarks, query score trends, pull competitive overviews, and manage the prompt library from inside the assistant itself. Your AI presence data becomes something your AI agents can monitor and act on. ## Scoring and metric terms ### Coverage How prominently a company appears in a single AI response. A categorical rating assigned per evaluation: primary, substantial, peripheral, incidental, or absent. Coverage is a per-response label, not a rate; the rate built from it is the visibility rate. ### Visibility rate The share of scored responses where coverage is not absent, expressed as a fraction. Computed organic-only: prompts that name the company directly are excluded, so the rate measures whether AI brings you up unprompted in category and comparison questions. ### Sentiment How favorably the AI's response treats the company, scored 0-100 per response and aggregated on the same scale. Distinct from accuracy: a response can be warm and wrong, or cold and correct. ### Head-to-head outcome The result of a competitive AI evaluation: `we_win` (the AI recommended you), `competitor_wins` (it recommended them), `tie` (it declined to choose), or `not_compared` (no direct comparison was made). Outcomes are derived deterministically from the judge's structured signals, not from a single holistic "who won" judgment. Our published benchmark corpus covers 11,600 head-to-head evaluations across 136 competitors. ### Win rate (W-L-T) Wins divided by decided outcomes, where decided outcomes include ties in the denominator. Wins, losses, and ties are tallied separately; `not_compared` evaluations are excluded entirely. A stricter variant, the decisive win rate (wins divided by wins plus losses, ties excluded), appears in our [engine divergence analysis](/blog/why-ai-recommends-your-competitor), where the aggregate across engines was 70.7%. ### Claim accuracy The fraction of the AI's factual claims about your company that are correct. Misrepresentations are severity-weighted: a critical error about core positioning counts five times more than a minor tone issue. [Scoring methodology](/blog/we-stopped-asking-ai-who-wins). ### Confidence weighting An adjustment to claim accuracy based on how confidently the AI stated a false claim. False claims stated with certainty carry a 1.3x penalty. Hedged claims carry a 0.9x discount. Each response's hedging language is classified as certain, tentative, or uncertain. [Confident lies](/blog/confident-lies-are-worse-than-hedged-ones). ### Misrepresentation A factual claim an AI model makes about your company that contradicts the verified record. In KnitKnot reports, each misrepresentation carries a proof receipt: the exact knowledge-base source that contradicts the AI's claim, shown alongside the claim itself. The receipt turns "the AI is wrong" from an assertion into an auditable finding. ### Persona The buyer role an AI response is implicitly written for, inferred per response and canonicalized to six standard roles. Persona matters because the same comparison question gets a different answer for a compliance lead than for a startup founder, and benchmark coverage should span both. ## Source and citation terms ### Citation A URL or source that an AI model references in its response. Not all models display citations visibly. Perplexity and Claude show sources. ChatGPT sometimes does. The number of citations matters less than which citations shaped the recommendation (source gravity). ### Source balance The ratio of your cited sources to competitor cited sources in an AI response. A source balance of 0.2 means the AI built its answer primarily from competitor content. [Source influence](/blog/citations-are-ownership-claims). ### Source gravity A measure of how much influence a cited source had over the substance of an AI response. High-gravity sources shaped the recommendation and framing. Low-gravity sources appeared as footnotes. [Source ownership model](/blog/citations-are-ownership-claims). ### Owned sources The classification of every cited domain as yours, a competitor's, or third-party. KnitKnot tracks ownership per workspace with subdomain matching, and re-classifies past citations when ownership changes, so source balance and source-leak analysis stay consistent over time. ### E-E-A-T Experience, Expertise, Authoritativeness, Trustworthiness. Google's quality framework, increasingly used by AI models to determine source credibility. Named authors, visible dates, inline citations, and brand mentions are the key signals. ## Engine-specific terms ### Bing index The search index ChatGPT uses for web-connected queries. Submitting your sitemap to Bing Webmaster Tools directly affects ChatGPT's access to your content. ### Brave Search The search engine Claude uses for web lookups, publicly documented by Anthropic. Different index from Bing and Google, which means the same content may rank differently per engine. [How we approximate the Claude engine](/blog/approximating-the-claude-engine). ### AI Overviews Google's AI-generated answers that appear at the top of search results, synthesized from crawled pages with citation links. Visibility in AI Overviews depends on extractable, directly-answering content rather than traditional ranking alone. ### AI Mode Google's dedicated AI chat interface (limited rollout). Nearly all AI Mode queries are answered within the AI response without the user visiting any website, making it an almost entirely zero-click surface. ## Benchmark terms ### Adversarial benchmarking Testing AI models with the hardest questions buyers actually ask, not just brand queries. "What are the drawbacks of [your product]?" "Why would I choose [competitor]?" These prompts surface errors that standard monitoring misses. [Prompt methodology](/blog/rebuilding-prompt-generation). ### Prompt library The persistent set of benchmark prompts a company is evaluated against. Prompts are generated keyword-first from real Google search data (with search volume attached per prompt), layered across features and buyer personas, and persist across runs: each benchmark run snapshots the library, so results are comparable run over run. A brand and each of its product lines maintain separate libraries. ### Spot test A single benchmark prompt run on demand, outside a full benchmark run. Useful for quickly checking how an engine answers one specific question after a content change, before committing to a full re-measure. ### Grounding tier The evidence level for why a benchmark prompt was generated. **Direct**: maps to a real Google query with measurable search volume. **Category**: no direct query but category-level searches exist. **Synthesized**: AI-generated to test an uncovered edge. ### Cross-model disagreement The rate at which different AI models produce different competitive outcomes for the same prompt. In our data: 48.6%. [Engine divergence](/blog/why-ai-recommends-your-competitor). ### Absence rate The fraction of brand perception evaluations where the AI doesn't mention the company at all. Overall: 26.6%. Per engine: Perplexity 38.8%, ChatGPT 26.3%, Claude 23.9%, Gemini 17.7%. --- # How B2B buyers use AI to evaluate software vendors > 51% of B2B software buyers now start research with an AI chatbot. The evaluation happens before you know the deal exists, the shortlist is shorter than Google's, and the AI is working from sources you may not control. - Author: Max Wiesner - Published: 2026-04-21 - Canonical: https://knitknot.ai/learn/how-b2b-buyers-use-ai/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## Where do B2B buyers evaluate software now? Inside an AI chat window, before the vendor knows the deal exists. 51% of B2B software buyers now start their research with an AI chatbot rather than Google, and 69% of them change which vendor they choose based on what the AI tells them. When Kevin [wrote about starting KnitKnot](/blog/introducing-knitknot), he described something simple: he'd stopped comparing tools on Google. Every time he needed auth, a vector DB, or transactional email, he asked Claude. Then he signed up for whatever it told him. No demo. No sales call. He asked around. Most engineers were doing the same thing. So were founders. The buying conversation that used to happen on a sales call was happening in a single prompt, before the vendor knew it existed. That observation became the company. But at the time, we had anecdotal evidence and a gut feeling. Nine months later, the data caught up. ## How many B2B buyers start research with AI? 51%, according to G2's early-2026 B2B buyer behavior report, which quantified what we'd been seeing anecdotally. Three numbers stood out. **51% of B2B software buyers now start their research with an AI chatbot rather than Google.** Not "sometimes use AI." Start with it. AI is the first touch in the evaluation, not a supplementary tool. **69% of those buyers changed which vendor they ultimately chose based on what the AI told them.** The AI didn't just confirm their existing preference. It shifted it. For more than two-thirds of buyers, the AI's synthesis was influential enough to change the outcome. **55% say AI reduced the total number of vendors they evaluated.** The shortlist got shorter. AI doesn't return ten blue links. It returns an answer with four to seven named vendors. If you're not in that set, you don't get evaluated at all. These numbers describe a structural change, not a trend. The evaluation has moved from a multi-touch, multi-week research process into a compressed interaction that happens in minutes, in a chat window you have no visibility into. ## How does an AI vendor evaluation work? The buyer asks a comparison question, the AI synthesizes an answer from a handful of sources, and the buyer acts on it without verifying the claims. Our benchmark dataset, which includes 11,600 head-to-head evaluations across 136 competitors on ChatGPT, Claude, Perplexity, and Gemini, shows the process is roughly the same across models. **The buyer asks a comparison question.** Not "tell me about Acme." Something adversarial and specific: "Compare Acme and Widgetly for enterprise compliance automation." Or "What are the best tools for SOC 2 compliance for a Series B startup?" The query implies a decision. The buyer wants a recommendation, not a list. **The AI synthesizes from multiple sources.** The response isn't generated from a single database. The model pulls from its training data (a snapshot in time), live web search results (if the model supports them), and cached knowledge of publicly available content. It cites three to seven sources, usually a mix of vendor websites, third-party reviews, comparison articles, and community forums. **The AI produces a structured response.** A typical comparison response has a brief introduction, a feature-by-feature breakdown, a pricing comparison if the data is available, a discussion of strengths and limitations for each vendor, and a recommendation. The tone is authoritative. The response reads like it was written by an analyst who has done deep research. **The buyer acts on it.** 69% change vendors based on this. They might ask a follow-up question. They might visit the recommended vendor's website. They might skip straight to a sign-up page. What they almost never do is verify the factual claims in the response against each vendor's actual website. The AI said it, so it must be true. ## What patterns show up in AI evaluation benchmarks? Three patterns emerge when you analyze how AI handles B2B evaluation queries at scale: recommendations are inconsistent across models, the comparison framework comes from whoever published comparison content first, and factual errors compound into confident wrong recommendations. ### The recommendation is inconsistent across models The same comparison query asked to four models often produces four different recommendations. We see this regularly in head-to-head benchmarks. ChatGPT recommends Vendor A. Claude recommends Vendor B. Perplexity presents a balanced comparison. Gemini doesn't mention one of the vendors at all. This happens because each model has different training data, different web search indices, and different citation preferences. ChatGPT uses Bing's index and tends to favor well-established sources with high domain authority. Perplexity cites Reddit at disproportionate rates. Claude uses Brave Search and has different weighting for content recency. For the buyer, this means the recommendation they get depends on which AI they happened to open. For the vendor, it means a single-model monitoring strategy misses most of the picture. ### The comparison framework comes from whoever published first When the AI compares two vendors, it needs an evaluation framework: which dimensions to compare on, what features to highlight, how to structure the analysis. That framework almost always comes from existing comparison content. If Vendor A published a detailed "Us vs Vendor B" page, and Vendor B didn't publish anything, the AI's comparison framework mirrors Vendor A's page. Vendor A's strengths become the evaluation criteria. Vendor B's strengths might not get mentioned because they weren't in the source material. We measure this through [source gravity](/blog/citations-are-ownership-claims). In a significant fraction of competitive evaluations, the highest-influence source is one vendor's comparison page. The AI isn't generating a neutral analysis. It's synthesizing one vendor's competitive positioning into an authoritative-sounding answer. This is the single most actionable finding for B2B companies. You don't need better AI optimization. You need a comparison page that answers the buyer's exact question from your frame, structured so AI can extract it. ### Factual errors compound across the evaluation AI doesn't get one thing wrong. It gets several things subtly wrong, and the errors reinforce each other to produce a recommendation that feels well-reasoned but is built on incorrect premises. A typical cascade: the AI quotes your old pricing (stale data), says the competitor supports a feature you actually support (misattribution), and frames the comparison around dimensions where the competitor has published more content (narrative control). Each error individually might not change the recommendation. Together, they produce a confident recommendation for the competitor that a buyer would have no reason to question. This is why [we decompose the score](/blog/we-stopped-asking-ai-who-wins) instead of asking the AI for a holistic rating. A single "how well is this company represented?" question hides whether the problem is pricing, features, source influence, or visibility. The decomposition tells you which error type is driving the outcome. ## What does AI buying behavior mean for your funnel? AI collapses the consideration and evaluation stages into a single interaction that happens off your property. The traditional B2B marketing funnel assumes the buyer moves through stages you can see: awareness, consideration, evaluation, purchase. At each stage, you have content, touchpoints, and data. The buyer doesn't visit your blog to learn about the problem. They don't download your comparison guide. They don't attend your webinar. They ask the AI, and the AI synthesizes an answer from whatever sources it has access to. This has two implications. **First, the content that matters most is content AI can extract and cite.** Not the content that's best for human readers. Not the content that converts the best on your website. The content that directly answers a comparison query in a structured, extractable format. That might be a comparison page, a feature matrix, an FAQ, or a pricing table with plain text (not just images). **Second, you need visibility into what AI says before the buyer does.** By the time a prospect shows up on your website, they've already read the AI's evaluation. If the AI got something wrong, the prospect arrives with wrong expectations. If the AI recommended a competitor, the prospect might never arrive at all. The feedback loop for AI-driven evaluations doesn't exist in your analytics unless you build it. ## Is it too late to optimize for AI evaluations? No. The AI evaluation landscape is still forming, and most B2B companies haven't started measuring how AI represents them. The companies that benchmark their AI presence now and systematically fix the errors have a compounding advantage: every correction improves how the AI represents them in the next round of buyer queries, which improves the next round of corrections. Five brands capture 80% of AI recommendations in any given category. The brands that establish accurate, well-sourced AI presences in the next 12 months will be the five. The ones that wait will be competing for the remaining 20%. The data is clear on timing. AI-referred traffic already converts at 4.4x the rate of standard organic, and visitors spend 68% more time on site. The channel is smaller than Google today. It won't be for long. And unlike SEO, where you can invest later and still catch up, AI's winner-take-all dynamics mean that early accuracy advantages lock in. --- # AI Presence Management vs AEO vs GEO vs SEO > Four acronyms, four different problems. SEO gets you ranked. GEO gets you cited. AEO gets you mentioned. AI Presence Management tells you whether what AI says about you is actually true. - Author: Max Wiesner - Published: 2026-04-17 - Canonical: https://knitknot.ai/learn/ai-presence-management-vs-aeo-vs-geo-vs-seo/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## What's the difference between SEO, AEO, GEO, and AI Presence Management? The short version: SEO gets your content ranked, GEO gets it cited, AEO gets you mentioned in AI answers, and AI Presence Management verifies whether what AI says about you is accurate and competitive. They are layers in one stack, not competing alternatives. The terminology soup is actively unhelpful. Google published a guide in May 2026 arguing that AEO and GEO are "still SEO." Some vendors use AEO and GEO interchangeably. Others treat them as distinct disciplines with different tooling. Meanwhile, "AI visibility" has become a catch-all that means different things depending on who's selling it. Here's how we think about the distinctions, based on what we've seen building a benchmarking pipeline that touches all four. ## What does SEO still do for AI visibility? SEO gets your content into the indices that AI models draw from. Optimize your content so search engines rank it higher: target keywords, build backlinks, improve page speed, structure your data. The metric is position on a results page. The user clicks a link and visits your website. SEO still matters. Strong SEO fundamentals improve how AI models perceive your content. Authoritative backlinks, clean structured data, and high-quality content are signals that both search engines and AI models use when deciding what to cite. A page that ranks well on Google is more likely to appear in an AI response than a page that doesn't. But SEO was designed for a world where the user clicks through to your site. That world is shrinking. 58.5% of Google searches are now zero-click. In Google's AI Mode, the zero-click rate hits 93%. The user gets an answer and never visits any of the ranked pages. SEO gets you onto the list that AI models draw from. It doesn't control what the AI does with your content once it gets there. ## What is GEO (Generative Engine Optimization)? GEO is the input side of the AI visibility equation: structuring your content so AI models can extract and cite it effectively. The term comes from a 2024 Princeton paper published at ACM KDD that studied which content attributes increase citation rates in AI-generated responses. Their findings were specific. Adding statistics to content improved AI visibility by 41%. Adding source citations improved it by 115% for lower-ranked pages. Question-based headings produced a 2.8x citation lift. Direct answers in the first 40-60 words of a page significantly increased extraction rates. GEO is tactical. It tells you how to write pages, how to use schema markup, how to structure FAQ content, how to make your pricing extractable instead of buried in an interactive widget. The output is content that AI models are more likely to pull from when constructing a response. The limitation is that GEO focuses on the input, not the output. You can have perfectly GEO-optimized pages and still lose evaluations because the AI is citing your competitor's GEO-optimized pages instead. GEO makes your content extractable. It doesn't guarantee the AI will extract from you rather than from someone else. And it has nothing to say about whether the AI's extraction is accurate. ## What is AEO (Answer Engine Optimization)? AEO is the discipline focused on getting your brand mentioned in AI-generated responses. Where GEO is about content structure, AEO is about the outcome: are you in the AI's answer when someone asks about your category? AEO tools (and there are now [248 on G2](/learn/best-ai-visibility-tools)) typically track mention frequency, share of voice, citation count, and sentiment across AI platforms. They tell you how often you appear, what the AI's tone is when it mentions you, and which sources get cited alongside you. The metric is visibility. The question AEO answers is: "Am I in the room?" That's necessary information. If AI doesn't mention you at all, nothing else matters. But AEO stops at the mention. It counts whether you appeared. It doesn't analyze what was said about you when you did. A tool can report that you were mentioned in 73% of relevant AI responses with positive sentiment, and that can be true while the AI is simultaneously getting your pricing wrong, attributing your core feature to a competitor, and recommending the competitor based on their content rather than yours. High AEO visibility with low accuracy is not a signal of health. It's a signal of exposure. The more often the AI mentions you, the more often [it says something wrong about you](/blog/ai-is-lying-about-your-company) to a real buyer. ## What does AI Presence Management add? AI Presence Management is the layer that verifies what AI actually says. It addresses the output, not the input. Not "are you mentioned" but "what gets said when you are." The distinction is best understood through a concrete example. A company can have: - Strong SEO (ranks #1-3 for core terms) - Good GEO (pages structured for extraction, FAQ schema, statistics every 200 words) - High AEO visibility (mentioned in 80% of relevant AI responses) - And still lose every competitive evaluation because the AI is quoting last year's pricing, attributing a competitor's feature launch to them, and building its recommendation from the competitor's comparison page Everything above the AI Presence Management layer is working. The content is findable, extractable, and cited. The problem is what the AI does with it: the synthesis, the comparison, the recommendation, the factual accuracy of specific claims. AI Presence Management adds three things the other disciplines don't cover. **Claim-level accuracy.** Decomposing the AI's response into individual factual claims and verifying each one. Not "the sentiment was positive" but "the AI stated your product starts at $299/month, which has been wrong since January." [We do this with structured signal extraction](/blog/we-stopped-asking-ai-who-wins) rather than asking another model to rate the response, and every flagged misrepresentation carries the exact source that contradicts it. **Competitive positioning analysis.** Measuring how the AI positions you relative to specific competitors. Feature win/loss ratios, recommendation rates, the comparison framework the AI used, and whether that framework came from your content or your competitor's. The [source gravity model](/blog/citations-are-ownership-claims) shows which pages controlled the answer. **Adversarial benchmarking.** Testing AI with the prompts buyers actually type, including the hard ones. "What are the drawbacks of [your product]?" "Why would I choose [competitor] over [you]?" "Is [your product] worth the price?" These aren't soft prompts. They're the ones that surface the errors that cost deals. We ground every prompt in [real search behavior](/blog/rebuilding-prompt-generation): each prompt library is built from Google search data, with search volume attached per prompt, so the benchmark reflects questions buyers actually ask. ## How do SEO, GEO, AEO, and AI Presence Management fit together? They're layers in a stack, not competing alternatives. Each layer depends on the ones below it, and the sequence matters.
Layer Question it answers What it measures What it misses
SEO Can AI find my content? Search ranking, backlinks, domain authority Whether AI uses the content, and how
GEO Can AI extract from my content? Content structure, schema markup, extractability Whether AI chose your content over a competitor's
AEO Am I mentioned in AI answers? Mention frequency, share of voice, sentiment Whether what AI says about you is accurate
AI Presence Management Is what AI says about me true? Claim accuracy, competitive positioning, source influence, recommendations Requires the lower layers to be functional first
There's no point measuring claim accuracy (AI Presence Management) if the AI doesn't mention you at all (AEO). There's no point optimizing for AI citations (AEO) if your content isn't structured for extraction (GEO). There's no point structuring content for AI (GEO) if search engines can't find it in the first place (SEO). But the order of diagnostic value is reversed. The most revenue-relevant question is at the top: what is AI telling buyers about me, and is it true? That's the question where errors directly cost deals. A stale pricing claim loses more revenue than a missing schema tag. ## Which one do you need? If you have no AI visibility at all, start with SEO and GEO. Build the content infrastructure that makes your pages findable and extractable. This is table stakes. If you're visible in AI responses but don't know what AI is saying, you need AI Presence Management. Benchmark the actual responses. Verify the claims. Find out whether your visibility is an asset or a liability. If you're visible and accurate but want to improve your share of AI recommendations, AEO tools will help you track and optimize mention frequency, sentiment, and share of voice. Most B2B companies we work with discover that they're in the middle state: they're visible enough that AI mentions them, but they've never checked what AI actually says. They assume presence is positive. [It often isn't](/blog/ai-is-lying-about-your-company). ## Is Google right that AEO and GEO are "still SEO"? Google's May 2026 position that AEO and GEO are "still SEO" is technically correct and practically misleading. The technical foundation is shared: structured data, authoritative content, crawlability. If you do SEO well, you're doing some of AEO and GEO by default. But the measurement and optimization loops are different. SEO measures position on a results page. AEO measures mention in a generated response. AI Presence Management measures the accuracy of specific claims within that response. You can't use Search Console to find out that ChatGPT is quoting your 2024 pricing. You can't use a keyword tracker to discover that Claude is attributing your competitor's feature to you. The tools are different because the problems are different. Google is right that the principles overlap. But the same content can rank #1 on Google, get cited by ChatGPT, and still lose the buyer because the AI synthesized a confident, wrong recommendation from an outdated source. SEO is the foundation. It is not the whole building. ## Frequently asked questions ### Do I need separate tools for SEO, AEO, GEO, and AI Presence Management? Not necessarily. SEO tools (Ahrefs, Semrush) now include AI visibility features. GEO is primarily a content practice, not a tool category. AEO is covered by dedicated platforms (Profound, Peec AI, Otterly, Gauge) and SEO add-ons. AI Presence Management, which adds accuracy benchmarking and competitive analysis, is the layer where dedicated tooling matters most because it requires structured evaluation infrastructure that SEO and AEO tools don't provide. ### Which should I prioritize? Diagnose before you optimize. Most companies start with AEO or GEO because those are the loudest categories in the market. But if you don't know whether AI is saying accurate things about you, optimizing for more visibility amplifies the problem. Benchmark your AI presence first. Fix factual errors. Then optimize for broader visibility. ### Is AEO the same as GEO? They overlap but address different sides. GEO is about making your content AI-friendly (input optimization). AEO is about appearing in AI answers (output measurement). You can have great GEO and low AEO if your competitor's content outranks yours in the AI's source ranking. You can have high AEO and poor GEO if you appear in answers despite unstructured content, which usually means the AI is citing someone else's content about you. ### Will AI Presence Management replace SEO? No. SEO remains the foundation that makes content discoverable to both search engines and AI models. AI Presence Management is an additional layer that addresses the new problem of AI-synthesized answers. The two are complementary. Many of the content fixes that improve AI Presence, like clearer competitive positioning and structured pricing data, also improve traditional SEO. ### What does "adversarial benchmarking" add that AEO monitoring doesn't? AEO monitoring tracks how AI mentions you across a set of prompts, usually brand and category queries. Adversarial benchmarking tests AI with the hardest questions buyers ask: "what are the drawbacks," "why would I choose the competitor," "is it worth the price." These prompts surface errors and competitive framing that brand-monitoring prompts never trigger. The hard questions are where deals are won and lost. --- # What is an AI Presence Score? > The AI Presence Score is a 0-100 metric that measures how well AI models represent your company to buyers. It's built from seven structured signals, not a single model judgment. Here's how it works, what the numbers mean, and why the decomposition matters more than the headline. - Author: Max Wiesner - Published: 2026-04-10 - Canonical: https://knitknot.ai/learn/what-is-ai-presence-score/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## What is an AI Presence Score? The AI Presence Score is a 0-100 composite metric that quantifies how well AI models represent your company when buyers ask evaluation questions. Higher is better. A score of 80 means AI is generally accurate, favorable, and grounded in your content. A score of 35 means AI is getting things wrong, recommending competitors, or citing sources that work against you. The score is useful as a headline. But the headline isn't the product. The product is the decomposition: the seven structured signals that add up to the number and tell you exactly what's working, what's broken, and what to fix first. We built it this way because [the alternative didn't work](/blog/we-stopped-asking-ai-who-wins). When we started, we asked GPT to rate each AI response on a 0-100 scale. It gave us numbers. We put them in charts. But when a company asked why they scored 52, we couldn't answer. The model had compressed recommendation quality, factual accuracy, sentiment, source influence, and feature coverage into a single judgment with no trace. Two runs on the same response came back with different numbers. The score was a vibe check, not a measurement. So we stopped asking for one number and started extracting seven. ## What are the seven signals behind the score? Coverage, recommendation outcome, feature comparisons, claim accuracy, sentiment, source influence, and confidence markers. Each signal captures a different dimension of how AI treated your company in a specific response, and each is extracted by an LLM judge doing semantic evaluation of the response, not keyword matching. The weighted composite across signals is the AI Presence Score for that evaluation. **Coverage** measures how prominently you appear in the response, on a five-level scale: primary, substantial, peripheral, incidental, or absent. A glowing assessment buried in a list entry is not the same as being the subject of the response. Absent coverage zeroes out everything else, because a recommendation nobody reads has no impact. **Recommendation outcome** is a win, loss, tie, or not-compared verdict for each head-to-head matchup in the response. The verdict is composed deterministically from the judge's structured signals (who was recommended, on what basis, with what framing), not from a single holistic "who won" judgment. The canonical win rate is wins divided by decided outcomes, with ties counted in the denominator of the overall tally, so the definition is the same everywhere it appears. **Feature comparisons** produce a per-feature verdict. For every feature the AI compared, did you win, lose, or tie against the named competitor? Feature win rates aggregate from these verdicts, so a feature-level loss is always traceable to the specific responses where the AI handed that feature to the competitor. **Claim accuracy** tracks misrepresentations: factual claims the AI made about you that contradict your actual product, pricing, or positioning. Each one carries a proof receipt, the exact knowledge-base source that contradicts what the AI said. **Sentiment** captures the overall framing on a 0-100 scale. This catches something the other signals don't. The AI can recommend you, get every claim right, and still frame you dismissively. The difference between "Acme is a solid choice" and "You could try Acme, I guess" is a sentiment signal, not a factual one. **Source influence** measures whose content the AI's cited sources belong to: your domains, competitor domains, or third parties. If the AI built its answer from five competitor blog posts and zero of your pages, the response was synthesized from competitor content. Ownership classification is explicit and per-workspace, so a subdomain you own counts for you. **Confidence markers** record the AI's conviction level: certain, tentative, or uncertain. A buyer who reads "Acme definitely does not support SOC 2" walks away with a different impression than one who reads "I'm not entirely sure about Acme's SOC 2 status." [Confident misinformation is more damaging](/blog/confident-lies-are-worse-than-hedged-ones) than hedged truth is reassuring, and the score weights it that way. Two more things happen at scoring time. Every response runs through a universal visibility floor (mention extraction and competitor/feature classification) plus specialty extractors selected per response, such as the competitive-comparison extractor. And the derived results are written down as facts at scoring time, so the same inputs produce the same score every time. No temperature, no prompt sensitivity, no "run it again and hope for the best." ## How does the score differ from coverage and visibility rate? They are three distinct metrics at three grains, and they are not synonyms. **Coverage** is a per-response category: how prominently you appeared in one answer (primary through absent). **Visibility rate** is an aggregate: the share of scored responses where coverage isn't absent, computed on organic prompts only. Prompts that name your company are excluded from the calculation, because showing up in a question about yourself proves nothing. **AI Presence Score** is the 0-100 composite across all seven signals. A company can have a high visibility rate and a mediocre score: AI mentions them everywhere but gets the facts wrong when it does. ## What do AI Presence Score ranges mean? Roughly: 80+ means AI is working for you, 60-79 means present with specific issues, and below 50 means AI responses are hurting your competitive position. We've computed scores across our full benchmark dataset, which includes 11,600 head-to-head evaluations across 136 competitors. The distribution tells you where a given score falls relative to that dataset.
Score range % of evaluations What it typically means
80-100 38.9% AI is recommending you, claims are accurate, sources are balanced. The response is working for you.
60-79 22.1% Mentioned with mostly accurate information. May have a weak recommendation, a missing feature, or a source imbalance dragging the score down.
40-59 17.3% Problematic. Usually a combination of factors: competitor recommended, one or two factual errors, source mix skewed toward competitor content. Actionable fixes exist.
20-39 11.1% AI is actively working against you. Multiple errors, competitor framing, or peripheral/absent coverage. Every buyer who gets this response is being steered away.
0-19 10.6% The AI either doesn't know you exist or has fundamentally wrong information. Absent from the response, or present with critical misrepresentations.
The distribution is skewed positive: 61% of evaluations score 60 or above. But the tail is heavy. 28.6% of evaluations score below 50. Nearly 1 in 3 AI interactions about your company is producing a response that hurts you. That's not an edge case. It's a structural fraction of buyer research. ## Why do scores vary by AI engine? Each engine has different training data, a different search index, and different citation preferences, so the same company gets different scores from different models. [This is consistent across our dataset](/blog/why-ai-recommends-your-competitor): Gemini produces the highest average AI Presence Score (67.5), Perplexity the lowest (62.7). The spread is 4.8 points, which sounds small until you realize it compounds across hundreds of buyer evaluations. The per-engine score matters because it tells you where your worst problem is. A company with a blended score of 65 might have a Gemini score of 72 and a Perplexity score of 55. The blended number suggests "room for improvement." The per-engine number says "Perplexity is an emergency." Per-engine trend snapshots are written with every run, using the same formulas as the dashboards, so the trend chart and the headline number can never disagree. ## Why does the decomposition matter more than the headline number? Because the decomposition is the diagnosis. A score of 42 means nothing by itself. A score of 42 where feature comparisons are strong but claim accuracy is weak tells you: "AI thinks your features are strong but it's recommending the competitor because it has outdated facts about your product. Fix the facts, and the recommendation probably flips." The decomposition also makes it possible to diff across runs. Score dropped 8 points this month? The signal breakdown tells you it was a recommendation swing, not a sentiment change. Score went up but you didn't do anything? A misrepresentation got corrected in the model's training data. Every movement is traceable to a specific signal. And every number is auditable. Each headline metric drills down to the exact underlying AI responses, and the drill-down list is the same row set the number was computed from. If someone asks why they scored 42, the answer should never be "the model felt that way." The answer should be: here are the three claims that were wrong (with the contradicting sources attached), here are the features you lost on, here is the competitor page that shaped the recommendation. Three different problems, three different fixes, none of them "make your product better." ## How does the score relate to win/loss outcomes? The AI Presence Score and the competitive outcome (win/loss/tie) measure different things. The score captures the full quality of the AI's representation: accuracy, sentiment, sources, coverage. The competitive outcome captures whether you got the recommendation. They correlate, but they're not the same. A company can win the recommendation and still have a mediocre score if the AI got several facts wrong along the way. That's a fragile win: the next query might flip the recommendation if the facts get slightly worse. A company can lose the recommendation and have a decent score on features and accuracy, but low source balance. That means the AI knows your product is strong but is citing competitor content as its primary source. The fix is publishing the comparison page that shifts the source balance, not changing the product. The decomposition tells you which scenario you're in. The overall score doesn't. ## Frequently asked questions ### How is the AI Presence Score different from an AI visibility score? Visibility measures whether you appear; the AI Presence Score measures the quality of what AI says when you do. The precise visibility metric is a visibility rate: the share of responses where your coverage isn't absent, computed on organic prompts only (prompts that name your company are excluded). The AI Presence Score adds factual accuracy, recommendation direction, source influence, sentiment, and confidence. A high visibility rate with a low AI Presence Score means AI mentions you often but gets things wrong when it does. ### Why seven signals instead of one holistic judgment? Because one judgment hides the diagnosis. When a model says you scored 52, you can't tell whether the problem is pricing, features, sources, or sentiment. The seven signals tell you exactly what's wrong and in what order to fix it. Different error types have different fixes, and the decomposition matches errors to remediation. ### Does the score change over time? Yes. AI models update their training data and search indices at different cadences. A pricing correction on your website might improve your ChatGPT score within weeks but take months to affect Claude. We track scores over time through periodic benchmarks so companies can see whether fixes are landing and which models are updating fastest. ### What's a good AI Presence Score? Based on the distribution across our benchmark dataset: 80+ puts you in the top 39% where AI is actively working for you. 60-79 is the middle ground where you're present but have specific issues to address. Below 50 (which 28.6% of evaluations fall into) means AI is producing responses that hurt your competitive position. ### Can I improve my score without changing my product? Almost always. The fixes are content-level, not product-level. Publishing a comparison page, updating pricing, adding structured data, creating documentation that directly answers buyer evaluation questions. The score measures how AI represents you, and representation is a function of the source material available to the AI, not the quality of the underlying product. --- # What is AI Presence Management? > AI Presence Management is the practice of benchmarking, monitoring, and improving how AI models represent your company to buyers. It goes beyond visibility tracking to measure accuracy, competitive positioning, and source influence across ChatGPT, Claude, Perplexity, and Gemini. - Author: Max Wiesner - Published: 2026-04-08 - Canonical: https://knitknot.ai/learn/what-is-ai-presence-management/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## What is AI Presence Management? AI Presence Management is the practice of benchmarking, monitoring, and improving how AI models represent your company when buyers ask evaluation questions. It covers four things: what AI says about you, whether it's accurate, how you compare to competitors in AI responses, and which sources are shaping the narrative. The discipline spans ChatGPT, Claude, Perplexity, Gemini, and Google AI Overviews. If SEO is about ranking on a page, AI Presence Management is about controlling what gets said when there is no page. When a buyer asks an AI "compare Acme and Widgetly for enterprise compliance," the AI doesn't return ten blue links. It returns an answer. One synthesized narrative, built from sources the buyer never sees, reflecting facts the AI may or may not have gotten right. AI Presence Management is the discipline of making sure that answer is accurate, favorable, and grounded in your content rather than your competitor's. ## Why does AI Presence Management exist now? Three shifts happened at once, and their intersection created a new problem category that existing tools don't cover. **Shift 1: Buyers moved to AI.** 51% of B2B software buyers now start their research with an AI chatbot rather than Google, according to [G2's 2026 buyer behavior report](https://company.g2.com). More importantly, 69% of those buyers changed which vendor they chose based on what the AI told them. The evaluation is happening inside the chat window, and most companies have zero visibility into it. **Shift 2: AI answers replaced search results.** Zero-click searches hit 58.5% on Google. In AI Mode, it's 93%. The buyer gets an answer and acts on it without visiting your website, reading your case studies, or talking to your sales team. The AI response is the first impression, the product comparison, and the shortlist, compressed into a single interaction. **Shift 3: AI gets things wrong.** This is the part most "AI visibility" tools ignore. 72% of brands have at least one factual error in AI-generated responses about them. Wrong pricing, fabricated founding dates, features attributed to the wrong company, competitive positioning based on two-year-old blog posts. When a buyer asks ChatGPT to compare you to a competitor and the response confidently states your product lacks a feature it has had for 18 months, that's not a visibility problem. That's a factual accuracy problem that's costing you deals you never knew existed. Each of these shifts has spawned its own partial solution. SEO addresses search ranking. AEO and GEO address optimization for AI citations. Brand monitoring tools track mentions. But none of them answer the complete question: *what exactly does AI tell buyers about my company, is it accurate, and how do I fix what's wrong?* That's the gap AI Presence Management fills. ## What does AI Presence Management cover? The discipline has four layers: benchmarking what AI currently says, auditing it for accuracy, analyzing competitive positioning, and tracing which sources shape the answers. Most tools in the market cover one or two of them. A complete AI Presence Management practice covers all four. ### Layer 1: Benchmarking Before you optimize anything, you need to know what AI currently says about you. Not in the abstract. In response to the specific questions your buyers actually ask. Benchmarking means running adversarial evaluation prompts across multiple AI models and analyzing the responses at the claim level. "Compare [your company] vs [competitor] for [use case]." "What are the drawbacks of [your product]?" "Which [category] tool is best for [persona]?" These are the prompts that determine whether a buyer puts you on the shortlist or moves on. The prompts should be grounded in real search behavior, not brainstormed: we build each prompt library around actual Google search data, with search volume attached per prompt, layered across features and buyer personas. The output isn't a visibility score. It's a structured analysis of every response: what was recommended, what features were compared, what claims were made, which were accurate, what sources were cited, and what the overall sentiment was. This is where most of the actionable signal lives. ### Layer 2: Accuracy auditing Visibility without accuracy is dangerous. Being mentioned in every AI response doesn't help if the AI is telling buyers your product doesn't support a feature it ships, or quoting pricing from two years ago, or attributing your competitor's latest launch to you. Accuracy auditing decomposes AI responses into individual claims and verifies them against ground truth. Did the AI get your pricing right? Did it correctly describe your feature set? Did it attribute the right capabilities to the right company? When it got something wrong, how confidently did it state the falsehood? Done properly, every flagged misrepresentation carries a proof receipt: the exact knowledge-base source that contradicts what the AI said, so the error is verifiable rather than asserted. That last question matters more than it seems. [A false claim stated with certainty](/blog/confident-lies-are-worse-than-hedged-ones) ("Acme does not support SOC 2") lands differently with a buyer than a hedged one ("I'm not entirely sure about Acme's SOC 2 status"). The damage scales with the AI's conviction, not just the error itself. ### Layer 3: Competitive intelligence AI doesn't evaluate your company in isolation. Every buyer question is implicitly or explicitly comparative. "Compare X and Y." "What's the best tool for Z?" "What are the alternatives to W?" The competitive intelligence layer analyzes how you perform relative to specific competitors across AI responses. Feature win/loss ratios. Recommendation rates. Sentiment differentials. Source influence patterns. Which competitor's content is shaping the AI's narrative about your category? This is fundamentally different from traditional competitive intelligence, which monitors what competitors say about themselves. AI Presence Management monitors what a neutral third party (the AI) says about both of you, synthesized from the public information ecosystem. It's the closest thing to eavesdropping on a buyer's internal evaluation process. ### Layer 4: Source influence When an AI cites five sources in a response, not all of them contributed equally. One source might have shaped the recommendation. Another might have supplied a background statistic nobody acted on. [Source gravity](/blog/citations-are-ownership-claims) measures which sources actually influenced the answer, and whether those sources belong to you, your competitor, or a third party. If your competitor's blog post is the high-gravity source driving recommendations in your category, that's a specific, actionable problem. It tells you exactly which page to write, what question to answer, and whose framing to compete with. ## How is AI Presence Management different from AEO, GEO, and SEO? AEO and GEO are about getting mentioned and cited; AI Presence Management is about what gets said once you are, and whether it's accurate. Here's how the disciplines relate to each other and where AI Presence Management fits.
Discipline Core question What it measures Limitation
SEO Do I rank on Google? Position, traffic, clicks Doesn't cover AI-generated answers
AEO Am I cited in AI answers? Citation rate, visibility score Counts mentions, not accuracy
GEO How do I optimize content for AI? Content structure, schema markup, extractability Focuses on input (content), not output (what AI says)
AI brand monitoring Where am I mentioned? Mention frequency, share of voice, sentiment Tracks presence, not what's being said
AI Presence Management What does AI tell buyers about me, and is it right? Accuracy, competitive positioning, source influence, recommendations Requires structured evaluation infrastructure
You can have perfect AI visibility and still lose deals because the AI confidently tells buyers your product doesn't support the one feature they care about most. AEO asks *"am I in the room?"* AI Presence Management asks *"what am I saying in the room, and is any of it wrong?"* These aren't competing disciplines. You need all of them. But the sequence matters. There's no point optimizing for AI citations (GEO) if the AI is going to cite wrong information. Benchmark accuracy first. Fix factual errors. Then optimize for visibility. ## How does the AI Presence Score work? The AI Presence Score is a 0-100 composite that quantifies how well AI models represent your company. It's not a single number from a single model judgment. It's built from structured signals extracted from every AI response by an LLM judge doing semantic evaluation, not keyword matching. [We decompose every AI response](/blog/we-stopped-asking-ai-who-wins) into structured signals rather than asking a model "how well is this company represented?" Each signal captures a different dimension of how the AI treated your company: 1. **Coverage.** How prominently you appear in the response: primary, substantial, peripheral, incidental, or absent. 2. **Recommendation outcome.** Win, loss, or tie against the competitor, derived deterministically from structured judge signals rather than a single holistic "who won" judgment. 3. **Feature comparisons.** For every feature the AI compared, a per-feature win, loss, or tie verdict. 4. **Claim accuracy.** Which of the AI's factual claims about you were wrong, each with the contradicting source attached. 5. **Sentiment.** The overall framing, scored 0-100. 6. **Source influence.** Were the answer's cited sources yours, your competitor's, or third-party? 7. **Confidence markers.** Did the AI state claims with certainty, tentatively, or with explicit uncertainty? Confident falsehoods are worse than hedged ones. Three related metrics are worth keeping distinct. **Coverage** describes one response. **Visibility rate** is the share of responses where coverage isn't absent (computed on organic prompts only, excluding prompts that name your company, because being mentioned in a prompt about yourself proves nothing). The **AI Presence Score** is the 0-100 composite across all the signals. The important part isn't the number. It's the decomposition. A score of 42 means nothing by itself. A score of 42 where features are strong but claim accuracy is weak tells you: *"AI models think your features are strong but they're recommending the competitor because they have outdated facts about your product. Fix the facts, and the recommendation probably flips."* That's the difference between a vanity metric and a diagnostic. And every headline number should drill down to the exact underlying AI responses: the list you click into is the same set of evaluations the number was computed from. Auditable, reproducible numbers, no marketing math. ## What does an AI presence benchmark reveal? A first benchmark typically surfaces three things you didn't know: specific fixable factual errors, large disagreements between models, and competitor content quietly driving the narrative. **The factual errors are specific and fixable.** Not "AI doesn't know about us" but "ChatGPT thinks our product starts at $299/month when we changed pricing to $149 eight months ago" or "Claude attributes our competitor's API-first architecture to us and our event-driven architecture to them." These are discrete, verifiable claims that produce discrete, actionable fixes. **Different models tell different stories.** ChatGPT might recommend you. Claude might recommend your competitor. Perplexity might present a balanced comparison that favors neither. Gemini might not know you exist. Each model has different training data, different citation preferences, and different synthesis patterns. A single-model view is incomplete. **Your competitor's content is driving the narrative.** This is the finding that surprises people most. When AI recommends a competitor, it's often not because the competitor's product is better. It's because the competitor has content that directly answers the buyer's question in a format AI can extract and synthesize. Their comparison page, their feature breakdown, their case study is the high-gravity source. Your product docs are background noise. ## How does the measurement loop work? AI Presence Management is a cycle, not a one-time audit: benchmark, diagnose, fix, re-measure. The loop works like this: ![The AI Presence measurement loop: benchmark buyer prompts across engines, diagnose the highest-impact gaps, ship targeted content fixes, then re-run the same prompts to verify the changes landed](/images/learn/measurement-loop.svg) **1. Benchmark.** Run adversarial evaluation prompts across models. Get the structured decomposition for every response. Identify exactly where you're winning, where you're losing, and why. **2. Diagnose.** Prioritize by impact. A false claim about your core differentiator stated with high confidence across multiple models is more urgent than a minor sentiment issue on a low-traffic prompt. The decomposition gives you the priority ranking automatically. **3. Fix.** The fixes are usually content-level, not product-level. Write the comparison page that answers the buyer's question from your frame. Update the pricing page the AI is training on. Publish the case study that demonstrates the capability the AI says you lack. Structure it for AI extraction: direct answer first, structured data, FAQ schema. **4. Re-measure.** Run the same prompts again. Because the prompt library persists across runs, every re-run is directly comparable to the last one, and the diff is explicit: mention deltas, entity shifts, citation impact, and the score trend per engine. Did the recommendation flip? Did claim accuracy improve? Did your sources gain gravity? If the score moved, the fix worked. If it didn't, the AI hasn't re-indexed your content yet, or the fix wasn't targeted enough. For a quick check on a single question, a spot test runs one prompt on demand without a full benchmark. The cadence depends on how aggressive your competitors are. For actively contested categories, monthly. For stable markets, quarterly. ## Which companies need AI Presence Management most? Companies in markets where buyers routinely compare vendors through AI before making contact. That describes most of B2B software today, but the impact concentrates in categories with three properties. **Active competitive comparison.** If buyers regularly ask "compare X vs Y for [use case]," the AI is generating synthesized evaluations that directly influence shortlisting. Categories with 3-5 named competitors that show up in sales discovery calls are the highest-signal environment for AI Presence Management. **Factual complexity.** Products with nuanced feature sets, tiered pricing, and technical differentiators are more susceptible to AI misrepresentation than simple products. The more facts the AI needs to get right, the more facts it gets wrong. **High evaluation stakes.** When a single lost evaluation costs thousands in pipeline value, the ROI on fixing an AI factual error is immediate. A stale pricing claim that costs you 10 qualified evaluations a month has a clear dollar value. ## How big is the AI visibility market? The category is moving fast. G2 created a formal "Answer Engine Optimization" category in March 2025. As of mid-2026, it has 248 listings and has grown 2,000%. Over $300M in venture capital has flowed into the space. Profound reached a $1B valuation. Sitecore acquired Scrunch for $225M. Adobe acquired Semrush. HubSpot launched a standalone AEO tool. But most of these tools are visibility trackers. They tell you where you're mentioned, how often, and with what sentiment. That's necessary but not sufficient. Visibility without accuracy is a false sense of security. You can have a high share of voice and still be losing deals because the AI is confidently misinforming buyers about your product. AI Presence Management encompasses visibility tracking but adds the layer that actually drives revenue outcomes: accuracy, competitive positioning, and source influence. It's the difference between knowing you were in the room and knowing what you said. ## Can you audit your AI presence manually? Yes. A useful first pass takes about thirty minutes and requires no tools. The underlying method is straightforward. Open ChatGPT, Claude, Perplexity, and Gemini. Ask each of them to compare you against your top competitor for your primary use case. Read the response. Check every factual claim against your current website. Note the recommendation, the cited sources, and whether any facts are wrong. Most companies discover at least one verifiable error in the first session. The error is usually specific: wrong pricing, a feature attributed to the wrong company, a competitive framing built from the competitor's comparison page. The specificity is what makes it actionable. You're not looking at a vague sentiment score. You're looking at a claim you can verify and a source you can trace. The manual version breaks down at scale. Four prompts across four models gives you sixteen data points. A real benchmark runs hundreds of buyer evaluation prompts, decomposes every response into structured claims, and tracks changes over time. But the manual version is enough to understand the problem and decide whether it's worth measuring systematically. ## Frequently asked questions ### How is AI Presence Management different from AEO? AEO focuses on getting your content cited by AI platforms. AI Presence Management analyzes what AI actually says about you once you are cited. You can have high AEO visibility and still lose deals if the AI is stating incorrect facts about your product. AEO is one input to AI Presence Management, not a substitute for it. ### How is AI Presence Management different from GEO? GEO focuses on structuring content so AI models can extract and cite it effectively. It's about the input side: making your content AI-friendly. AI Presence Management focuses on the output side: what the AI's actual responses say when buyers ask evaluation questions. Both are important. GEO is a tactic within a broader AI Presence Management strategy. ### What is an AI Presence Score? A 0-100 composite metric built from structured signal extraction rather than a single model judgment. It decomposes responses into signals: coverage, recommendation outcome, per-feature win/loss/tie verdicts, claim accuracy, sentiment, source influence, and confidence markers. The decomposition is what makes it actionable. A score of 42 where features are strong but claim accuracy is low tells you something different than a score of 42 where everything is mediocre. Full breakdown in [What is an AI Presence Score?](/learn/what-is-ai-presence-score) ### Why does AI accuracy matter more than AI visibility? Being mentioned incorrectly is worse than not being mentioned at all. When AI confidently states your product lacks a feature it has, the buyer forms a wrong impression with high conviction. The confidence of the AI's response is not correlated with the accuracy of the response. That asymmetry is why accuracy auditing matters more than mention counting. ### Which AI models matter? ChatGPT (uses Bing's index), Claude (uses Brave Search), Perplexity, Gemini, and Google AI Overviews. Each has different training data and citation preferences. Cross-model coverage matters because the same buyer question asked to four models often produces four different sets of facts about the same company. When models disagree on a factual claim, at least one is wrong. ### Can I fix what AI says about my company? Yes, but not by contacting the AI companies. The fixes are content-level. AI models synthesize from publicly available sources, so improving how AI represents you means publishing content that directly answers buyer evaluation questions, structured for AI extraction, with accurate facts and current pricing. Structured data and high-authority sources accelerate how quickly models pick up new information. ### How long does it take for AI to update? Perplexity pulls from live search and can reflect changes within days to weeks. Google AI Overviews follow Google's index, so updates are relatively fast. ChatGPT updates its Bing index periodically, taking weeks to months. Claude's schedule is less predictable. The variance across models is part of why cross-model benchmarking matters. ### What's the difference between AI visibility and AI presence? AI visibility measures whether you appear. The precise version is a visibility rate: the share of AI responses where your coverage isn't absent, computed on organic prompts only (prompts that name your company are excluded, since appearing in them proves nothing). AI presence adds what gets said when you do appear: claim accuracy, recommendation direction, source influence, competitive framing. High visibility with low accuracy is a liability, not an asset. ### Can my AI assistant query my AI presence data? Yes, through an MCP (Model Context Protocol) server. KnitKnot runs one at mcp.knitknot.ai with around 40 tools: Claude, ChatGPT, or any MCP client can connect to your workspace, run benchmarks and spot tests, pull score trends and mention rollups, list misrepresentations, and fetch the remediation playbook. In practice this means you can ask Claude "how did my AI presence change this week?" and get an answer computed from your own benchmark data. Most AI visibility tools do not offer this. ### What does "adversarial benchmarking" mean? Testing AI models with the hardest questions your buyers actually ask, not just favorable ones. "What are the drawbacks of [your product]?" "Why would I choose [competitor] over [you]?" These prompts surface the gaps that matter: where AI gets defensive questions wrong, where competitor framing dominates, and where factual errors have the most impact on purchase decisions. The methodology is described in detail in [how we rebuilt prompt generation](/blog/rebuilding-prompt-generation). ### What is source gravity? A measure of how much influence a cited source had over the substance of an AI response. A high-gravity source shaped the recommendation. A low-gravity source appeared as a footnote. Tracking source gravity tells you which specific pages are controlling the AI narrative in your category. Described in detail in [our source ownership model](/blog/citations-are-ownership-claims). --- # Blog posts (full text) # The 10 questions AI buyers ask that your website can't answer > We generate benchmark prompts grounded in real Google search data, with search volume attached to each one. The questions buyers ask ChatGPT, Claude, Perplexity, and Gemini are more adversarial, more specific, and more comparative than anything your website was designed to handle. Here are the ten patterns that show up most. - Author: Max Wiesner - Published: 2026-06-11 - Canonical: https://knitknot.ai/blog/ten-questions-ai-buyers-ask/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## What do buyers actually ask AI about you? Short answer: comparison questions, drawback questions, and switching questions. Almost never the questions your website was written to answer. When we [rebuilt our prompt generation pipeline](/blog/rebuilding-prompt-generation), we grounded every benchmark prompt in real Google search data, with measured search volume attached to each one. That gave us an empirical view of what buyers type into AI, not what marketers assume they type. The queries aren't polite. They're adversarial, comparative, and specific. Here are the ten patterns that show up most in real buyer search behavior, and why most B2B websites can't answer them. ### 1. "Compare [you] vs [competitor] for [specific use case]" This is the prompt that matters most. It implies a decision, names both vendors, and specifies the use case. Your website has a product page and maybe a comparison page. But does it answer the specific use case? "Compare Acme vs Widgetly for SOC 2 compliance at a Series B startup" is a different question than "Compare Acme vs Widgetly," and the AI answers it with whatever use-case-specific content exists. Usually that content is the competitor's. ### 2. "What are the drawbacks of [your product]?" Buyers ask about weaknesses explicitly. Your website is designed to highlight strengths. When the AI can't find a balanced assessment in your content, it builds one from reviews, Reddit threads, and competitor comparison pages. The framing of your drawbacks is written by everyone except you. ### 3. "Is [your product] worth the price?" Not "how much does it cost" but "is it worth it." This asks the AI to make a value judgment. If the AI has your old pricing and doesn't know about your recent feature launches, the value assessment is built on stale data. The result reads like a review of a product that no longer exists. ### 4. "Why would I choose [competitor] over [you]?" The hardest prompt. It asks the AI to make the case for the competitor. If the competitor has published content that answers this question (a comparison page, a migration guide), the AI cites it directly. If you haven't published your counter-argument, there is no source to balance the narrative. ### 5. "What's the best [category] tool for [persona]?" This is a shortlisting prompt. The AI returns an answer with [four to seven named vendors](/learn/how-b2b-buyers-use-ai), not ten blue links. If you're not in the answer, you don't get evaluated at all. The AI decides category membership from public content: reviews, comparison articles, community mentions. Companies with thin public presence get excluded before the evaluation starts. ### 6. "[Your product] vs [competitor] pricing comparison" Pricing-specific comparisons are where AI errors are most damaging, because the AI quotes numbers. If the numbers are wrong, the buyer runs budget math on false data and either disqualifies you as too expensive or arrives at a sales call anchored to a price you don't offer. Stale pricing pages and third-party pricing roundups are the usual culprits. ### 7. "Does [your product] integrate with [specific tool]?" Buyers ask about specific integrations. Your integrations page might list them, but if the page is structured as a logo grid without text, the AI can't parse it. A dedicated integrations page with plain-text descriptions of each integration is what AI needs to answer this accurately. ### 8. "What do users say about [your product]?" This prompt pulls from reviews, Reddit, and community forums. The AI synthesizes user sentiment from sources you don't control. If the most visible user feedback is a critical Reddit thread from a year ago, that's what the AI reports, even if you've fixed every issue mentioned in it. ### 9. "[Your product] for enterprise vs [competitor] for enterprise" Enterprise-specific comparisons require the AI to know about your enterprise features: SSO, SOC 2, SLAs, dedicated support, deployment options. If this information isn't on a crawlable page in plain text, the AI can't include it in the comparison, and the vendor whose enterprise page is parseable wins by default. ### 10. "Should I switch from [competitor] to [you]?" The migration prompt. The buyer is already using the competitor and considering a switch. Answering requires the AI to assess migration difficulty, feature parity, and switching costs. If you don't have a migration guide or switching comparison page, the AI has nothing to work with, and "switching is risky" is the default narrative. ## What your website was built for vs what AI needs Your website was built for human visitors who browse, click, and read. AI needs something different: direct answers to specific questions in extractable formats. The gap between those two designs is where evaluation losses happen. **Your website has:** a homepage, product pages, a pricing page (maybe with a calculator), a blog, case studies behind a gate. **AI needs:** comparison pages for each competitor, FAQ content matching the exact questions above, plain-text pricing, feature pages with explicit capability lists, integration documentation with text descriptions. Most B2B websites are built to present strengths, not to answer hard questions. That gap is where [AI fills in with whatever sources it can find](/blog/ai-is-lying-about-your-company), and those sources are usually not yours. ## You can't answer these questions on your website alone Here's the uncomfortable part: publishing better content is necessary but not sufficient, because you can't see the evaluation from your side of it. The buyer asks, the AI answers, and neither event touches your analytics. The only workable approach is a measurement loop. First, find out what AI actually answers: run the adversarial questions above, grounded in real search data, across ChatGPT, Claude, Perplexity, and Gemini, and read the responses. Then diagnose which sources produced each answer and which claims are wrong or missing. Then fix the sources: publish the comparison page, the migration guide, the plain-text pricing. Then re-run the same prompts and measure the delta. That loop is what [we benchmark](/learn/what-is-ai-presence-management). The ten questions in this post aren't hypotheticals; they're the prompt patterns with the highest real search volume in our generation pipeline. Whatever the AI is answering for them today, it's answering from sources. The question is whether any of those sources are yours. --- # Why AI recommends your competitor instead of you > We analyzed 33,000 AI evaluations across four models. The most surprising finding: models disagree with each other on who to recommend 48.6% of the time. Which model the buyer opens matters more than most companies realize. - Author: Max Wiesner - Published: 2026-06-05 - Canonical: https://knitknot.ai/blog/why-ai-recommends-your-competitor/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## The question everyone asks first When AI recommends your competitor, the reason is usually not your product. Across 33,000 evaluations on ChatGPT, Claude, Perplexity, and Gemini, the losses trace to who owns the sources, which facts went stale, and whose features framed the comparison. All of which are content problems, not product problems. Companies assume otherwise. The first thing everyone wants to know when they run a benchmark is why they're losing, and the assumed answer is always: not enough features, wrong positioning, missing a capability the competitor has. Sometimes that's true. Usually it isn't. The 33,000 evaluations cover head-to-head comparisons and brand perception queries for companies in B2B software, scored with [structured signal extraction](/blog/we-stopped-asking-ai-who-wins) across 39,000+ feature comparisons and 513 distinct features. Here's what we actually found. ## Models disagree with each other 48.6% of the time This was the finding that surprised us most. We looked at every prompt that was evaluated by at least two engines with a decisive outcome (win or loss, excluding ties). 2,358 prompts met that criteria. Of those, 1,145 produced different outcomes depending on which model answered. 48.6%. Nearly half. Same company. Same competitor. Same prompt. Different engine, different winner. A buyer who asks ChatGPT gets told to go with Vendor A. A buyer who asks Gemini gets told to go with Vendor B. Both responses are confident. Both cite sources. Both sound authoritative. They just disagree. This means a company's competitive position in AI isn't a single number. It's four numbers, one per model, and they may tell opposite stories. The buyer's impression depends on which app they happened to open. ## Not all engines are equal The disagreement isn't random. Each engine has systematic biases that show up consistently across companies.
Engine Win rate Avg score Absence rate Positive sentiment
Gemini 74.4% 67.5 17.7% 67.7%
Claude 71.9% 65.8 23.9% 63.8%
ChatGPT 69.4% 64.4 26.3% 63.1%
Perplexity 66.1% 62.7 38.8% 58.3%
The spread is 8.3 percentage points on win rate between the most favorable engine (Gemini, 74.4%) and the toughest (Perplexity, 66.1%). That's not noise. For a company running hundreds of buyer evaluations a month across models, 8 points is the difference between winning two-thirds of comparisons and winning three-quarters of them. The absence numbers are even more striking. On Perplexity, companies are completely absent from 38.8% of brand perception evaluations. On Gemini, it's 17.7%. A 21-point gap in whether the model even knows you exist. ## Why each engine behaves differently The differences map to how each model finds and weighs information. **ChatGPT** uses Bing's index. It favors well-established sources with high domain authority. Companies with strong traditional SEO tend to do relatively well on ChatGPT, but the index can lag behind content updates by weeks or months. ChatGPT produces the highest rate of negative sentiment (14.8%) and the second-highest absence rate, which suggests it's both opinionated and selective about who it includes. **Claude** uses Brave Search. Its source mix skews differently from Bing, which means different pages show up as high-influence sources. We've seen cases where a company's technical documentation ranks well in Brave but not Bing, producing a Claude evaluation that's grounded in different source material than ChatGPT's. Claude lands in the middle on most metrics. **Perplexity** is the outlier. It's the toughest engine across the board: lowest win rate, lowest average score, highest absence rate, and the most neutral sentiment (29.9% neutral vs ~20% for other engines). Perplexity cites Reddit at disproportionate rates, which means community perception carries more weight than vendor content. If an 11-month-old Reddit thread says your product has a limitation you've since fixed, Perplexity is the engine most likely to still be repeating it. **Gemini** is the most favorable. Highest win rate, highest score, lowest absence rate, most positive sentiment. We don't have a definitive explanation for why. One hypothesis: Gemini draws from Google's index, which is the deepest and most current. Companies with strong Google SEO infrastructure have more source material available for Gemini to synthesize from, and more of it is current. ## The absence problem In 26.6% of brand perception evaluations across all engines, the benchmarked company was completely absent from the AI's response. Not misrepresented. Not mentioned with wrong facts. Just not there. The AI answered a question about the company's category and didn't include them. Same company, same category, same question: one engine includes you, another doesn't. Absence is harder to fix than inaccuracy. When AI gets a fact wrong, the fix is specific: update the page, add structured data, publish a correction. When AI doesn't know you exist, the fix is broader: build the content authority that earns you a place in the AI's answer set. That takes time, and the [winner-take-all dynamics](/learn/how-b2b-buyers-use-ai) mean the window for establishing presence is narrowing. ## 28.6% of evaluations score below 50 This is the number that reframes the overall win rate. The aggregate decisive win rate across all engines is 70.7%. That sounds fine. Most companies are winning most of their head-to-head comparisons most of the time. But the distribution has a long tail. 28.6% of all evaluations, nearly 1 in 3, produce an AI Presence Score below 50. 15.5% score below 30. In those evaluations, the AI is actively working against the company: wrong facts, competitor-biased framing, missing coverage, negative sentiment. The 70.7% average hides these. A company can win 75% of its ChatGPT evaluations and lose 60% of its Perplexity evaluations. The blended number looks healthy. The per-engine number reveals a channel where buyers are being systematically steered away. ## What actually determines the recommendation Based on [our scoring decomposition](/blog/we-stopped-asking-ai-who-wins), the recommendation in any single evaluation is driven by three interacting signals. **Source influence.** Whose content shaped the AI's answer? If the [highest-gravity source](/blog/citations-are-ownership-claims) is the competitor's comparison page, the evaluation framework is built from their perspective. This is the most common driver of losses we see, and the most fixable. Publishing a comparison page that answers the buyer's question from your frame changes the source that shapes the AI's answer. **Claim accuracy.** Did the AI get the facts right? Wrong pricing, feature misattribution, and outdated positioning all [degrade the score](/blog/ai-is-lying-about-your-company). When the AI says you don't support a feature you've had for a year, that's a claim error that directly costs you the recommendation. **Feature-level outcomes.** Across 39,117 feature comparisons and 513 distinct features, the pattern is clear: companies don't win or lose across the board. They win on some features and lose on others. The features the AI chooses to compare determine who wins the evaluation. And the features the AI chooses to compare are shaped by the source material it has access to. These three signals interact. A competitor-owned high-gravity source introduces the competitor's feature strengths as the evaluation criteria, quotes their pricing accurately (because it's from their own page), and produces a recommendation that logically follows from a framework designed to make them look good. The fix in almost every case is specific content, not product changes. Write the comparison page. Update the pricing. Publish the feature documentation that makes your strengths the evaluation criteria instead of theirs. ## What this changes about monitoring If you're monitoring one model, you're seeing roughly half the picture. The model you're tracking might show you winning while the model half your buyers use shows you losing. Single-model monitoring isn't wrong. It's incomplete in a way that creates false confidence. Cross-model benchmarking isn't just about coverage. It's about finding the engines where you're weakest and understanding why. The per-engine differences aren't random. They trace to source selection, index freshness, and community content weighting. A company that's losing on Perplexity because of a stale Reddit thread has a different problem than a company that's losing on ChatGPT because of a competitor's comparison page. There's an irony in how we ended up exposing this. KnitKnot runs an MCP server, so you can connect Claude or ChatGPT directly to your workspace and ask the assistant itself how your AI presence changed this week: score trends per engine, mention rollups, competitor deep dives, the full picture. The same models that disagree about you become the interface for monitoring the disagreement. Four models, four source ecosystems, four sets of buyer impressions. The recommendation your buyer gets depends on which one they ask. The question is whether you know what each of them says. --- # What ChatGPT says when a buyer asks to compare you > We ran the same comparison prompt across ChatGPT, Claude, Perplexity, and Gemini for a B2B company. Four models gave four different answers. Two got the pricing wrong. One recommended the competitor based entirely on the competitor's own blog post. - Author: Max Wiesner - Published: 2026-06-03 - Canonical: https://knitknot.ai/blog/what-chatgpt-says-when-buyers-compare/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## One prompt, four answers We asked ChatGPT, Claude, Perplexity, and Gemini the exact same comparison prompt about the same company. We got four different answers, two wrong prices, and one recommendation built entirely from the competitor's own blog. The test is the simplest one we run. Take the company's top competitor and ask all four models: "Compare [company] vs [competitor] for [their primary use case]." It's the prompt that matters most because it's the prompt buyers actually type. Not "tell me about Acme." Not "what is Acme." The comparison prompt. The one that implies a decision. We did this for a Series B infrastructure company (anonymized here, with their permission to share the shape of the results). Their top competitor is a well-funded incumbent with strong content marketing. The prompt was straightforward: "Compare [Company] vs [Competitor] for real-time data processing in production environments." Here's what came back. ## ChatGPT recommended the competitor ChatGPT produced a structured response. Introduction, feature comparison, pricing comparison, recommendation. Professional tone. It read like a well-written analyst brief. The recommendation went to the competitor. The reasoning: better enterprise support, more mature documentation, wider integration ecosystem. Two of those three claims were accurate. The third, the integration ecosystem, was outdated by about eight months. The company had shipped 12 new integrations since the data ChatGPT was trained on. But the more interesting finding was in the sources. ChatGPT cited five pages. Three of them were the competitor's own content: a comparison page titled "[Competitor] vs [Company]," a blog post about their integration ecosystem, and their enterprise features page. The other two were a neutral G2 review and the company's own docs landing page. The competitor had published a comparison page. The company hadn't. So the AI built its competitive framework from the competitor's frame. The evaluation criteria, the feature dimensions, the overall narrative arc came from the competitor's content strategy. The AI just synthesized it into an authoritative-sounding answer. ## Claude disagreed Claude, running through Brave Search, recommended the company. Same prompt, opposite conclusion. The reasoning was different. Claude focused on performance benchmarks and developer experience. It cited the company's technical documentation heavily, including a benchmarking page that demonstrated latency advantages. The competitor's marketing content didn't rank as well in Brave's index, so Claude's source mix skewed toward technical documentation rather than marketing pages. Interestingly, Claude also got the pricing wrong, but in the opposite direction from ChatGPT. ChatGPT quoted the company's old pricing (too high). Claude quoted the competitor's old pricing (too low). Neither model had current pricing for either vendor. Two models, two wrong prices, two different recommendations. The buyer's experience depends entirely on which app they opened first. ## Perplexity was balanced but cited Reddit Perplexity presented a balanced comparison without a strong recommendation. It acknowledged strengths on both sides and suggested the choice depended on the use case. The source list was revealing. Perplexity cited six sources: a Reddit thread from r/dataengineering where someone asked about the two products, a Hacker News comment from a user who had evaluated both, two vendor pages (one from each company), and two blog posts from third-party engineering blogs. The Reddit thread was 11 months old. The commenter who recommended the competitor did so based on a limitation the company had since fixed. Perplexity treated it as current information because the thread was recent enough to be in its index. This pattern, Reddit threads carrying disproportionate weight in Perplexity's citations, is consistent across our benchmarks. [Research shows](https://www.aisosystem.com/en/blog/perplexity-sources-how-to-get-cited) that 46.7% of Perplexity's top cited sources come from Reddit. For companies in technical categories where Reddit discussions are active, this means community perception from months ago is actively shaping how Perplexity represents them to current buyers. ## Gemini barely knew them Gemini's response was the shortest. It described both companies in general terms, got the high-level positioning right, but didn't have enough detail to make a meaningful comparison. It fell back to generic recommendations: "evaluate based on your specific requirements" and "consider requesting demos from both vendors." The company appeared in the response, which means they cleared the visibility threshold. But the lack of depth meant Gemini couldn't compare them on any dimension that would actually help a buyer decide. For companies with lower public profiles, Gemini often skips them entirely. We track omission rate as a separate metric. Being absent from a Gemini response isn't the same problem as being misrepresented in a ChatGPT response. Different models, different failure modes. ## What we learned from one prompt This single prompt, run across four models, surfaced five problems the company didn't know they had. **Two models had wrong pricing.** ChatGPT was quoting the company's pre-restructuring price. Claude was quoting the competitor's old price. A buyer comparing costs in either model would make a decision based on numbers that hadn't been accurate for months. **The competitor's comparison page was driving ChatGPT's recommendation.** The company had strong technical content but no published comparison page. The competitor did. The AI built its evaluation framework from that page. The recommendation followed logically from the competitor's framing. **An 11-month-old Reddit thread was shaping Perplexity's analysis.** A user had mentioned a limitation that had since been fixed. Perplexity was still presenting it as current. **The integration gap had been closed but no model knew.** The company had shipped 12 integrations. The AI was working from a snapshot that predated those launches. No model had picked up the new integrations. **Gemini didn't have enough data to compare.** The company was present but shallow. A buyer using Gemini would get a non-answer that would likely send them to a different model for a real comparison. Five problems. Five different fixes. None of them involve changing the product. All of them involve publishing the right content, in the right format, to the right sources. ## The fixes were specific For the pricing problem: the company updated their pricing page with plain-text pricing (not just an interactive calculator), added schema markup, and published a short blog post announcing the current pricing structure. Within weeks, Perplexity (which uses live search) reflected the new numbers. For the comparison page gap: they published their own "[Company] vs [Competitor]" page, structured with a clear feature comparison table, use-case recommendations, and a direct verdict. When we re-ran the benchmark two months later, ChatGPT's recommendation had flipped. For the Reddit thread: there was no direct fix. You can't edit someone else's Reddit post. But the company's engineering team started posting detailed technical responses in relevant subreddit threads, which created newer, more accurate community content for Perplexity to cite. For the integration gap: the company created a dedicated integrations page with structured data listing all current integrations with launch dates. This gave models a single authoritative source for integration information. For Gemini's shallow coverage: this was the slowest to address. Gemini's index needed more source material to build a meaningful comparison. The combination of the comparison page, integrations page, and pricing page collectively gave Gemini enough data to produce a substantive response in later benchmark runs. This is the loop the whole product is built around: benchmark, fix, re-benchmark, read the deltas. Each run's Measurement tab shows exactly what shifted since the previous run, which mentions appeared or disappeared, which entities moved, which citations changed. "ChatGPT's recommendation flipped" isn't an anecdote in that view. It's a row in the diff. ## Try it yourself Open ChatGPT, Claude, Perplexity, and Gemini. Type the same prompt into each: *"Compare [your company] vs [your top competitor] for [your primary use case]"* Read all four responses side by side. Check the pricing. Check the feature attributions. Note which sources are cited. Note whether the recommendation is consistent or contradictory. (This is what a spot test does in KnitKnot: one prompt, run on demand across engines, scored the same way as a full benchmark.) If the responses are accurate and consistent, you're in better shape than most. If they're not, you now know exactly which facts are wrong and which sources are driving the narrative. That's the starting point for fixing it. What surprised us wasn't that the AI got things wrong. It's that four models got different things wrong, in different directions, citing different sources. A buyer's impression of the company depended entirely on which app they happened to ask. That level of inconsistency isn't something you can find by monitoring a single model. And it's not something you can fix with a single content update. Each model has different sources, different citation preferences, and different failure modes. The benchmark has to cover all of them. --- # Not all citations are equal > A source that shaped the AI's recommendation carries more weight than one that provided a background fact. We model which sources had the most influence over what the buyer heard. - Author: Max Wiesner - Published: 2026-05-30 - Canonical: https://knitknot.ai/blog/citations-are-ownership-claims/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## Five sources, one answer A citation that supplied the AI's recommendation is not equal to a citation that supplied a background statistic. Most benchmarks count them the same. When an AI responds to a buyer's comparison question, it typically cites three to seven sources. The standard treatment is a flat list: count how many are yours, how many are your competitor's, compute a ratio, move on. That's source balance, and it's useful the way a batting average is useful: directionally correct, deeply incomplete. One source might have supplied the recommendation. Another might have provided a single statistic. A third might appear in the footnotes without visibly influencing anything the buyer reads. Counting them as equivalent is like weighting a walk the same as a grand slam. The question isn't how many sources were yours. It's which sources shaped what the buyer heard. ## What is source gravity? Source gravity is how much influence a cited source had over the substance of the AI's response. A high-gravity source is one the AI paraphrased extensively, drew its recommendation from, or used to frame the competitive comparison. A low-gravity source is one that appeared as a footnote, provided a single data point, or was listed for completeness without shaping the narrative. The distinction matters because the owner of the highest-gravity source effectively controlled the answer. Consider a response where the AI recommends your competitor for enterprise compliance. It cites five sources: your product docs (background on your feature set), two of your competitor's blog posts (the basis for the recommendation and the feature comparison), an analyst report (a supporting claim about market positioning), and a community thread (a user opinion that reinforced the recommendation). Source balance says 1:2:1:1. Looks like a normal distribution. But the gravity distribution tells a different story. The competitor's blog posts shaped the recommendation and the feature framing. Your docs provided background that the AI acknowledged but didn't act on. The analyst report and community thread reinforced a conclusion that was already formed. The competitor owned the high-gravity sources. You owned a low-gravity one. The answer was written from their frame. ## The ownership-influence matrix This creates a 2D problem, similar to the accuracy-conviction matrix in [our scoring system](/blog/confident-lies-are-worse-than-hedged-ones). One axis is ownership (yours, competitor's, third-party). The other is influence (how much the source shaped the answer).
High influence Low influence
Your source Best case. Your content shaped the answer. Your positioning landed. Wasted presence. AI cited you but didn't use you. Your content isn't structured for extraction.
Competitor source Worst case. Competitor's framing drove the recommendation. They sold through the AI before you saw the lead. Noise. Competitor was cited but didn't shape the outcome. Low priority.
Third-party source Borrowed authority. An analyst or community voice is driving the answer. You need to understand why the AI trusts them more than you. Background. Supporting detail, not driving the narrative.
The top-left and bottom-left cells are where the action is. When a competitor's high-influence source drives a recommendation, the fix is specific: write the page that answers that exact buyer question from your frame, structured for AI extraction, not for human browsing. When a third-party high-influence source carries the answer, you need to either influence that source or create first-party content that competes with it. Flat source balance misses all of this. A 3:2 ratio in your favor sounds good until you realize the competitor's two sources shaped the recommendation and your three provided background no one acted on. ## Where this goes Today the ownership axis is fully built. Every cited source is classified as yours, your competitor's, or third-party, per workspace, with subdomain matching, and when you claim or reassign a domain, existing citations are re-classified. The Sources view shows the ownership balance across all evaluations, each competitor has its own cited-source list, and you can see which domains each engine actually cites for your category. That's the foundation. The influence axis is where we're headed. **Influence-weighted source balance.** Instead of counting citations, weight them by how much each source contributed to the recommendation, the feature comparison, and the competitive framing. A competitor-owned source that shaped the recommendation counts more than three of your docs pages that provided background facts. **Source authority scoring.** Some sources get cited repeatedly across evaluations. If the same competitor blog post shows up in 15 different AI responses about your category, that page has outsized authority in the model's understanding of your market. Surfacing these high-authority sources tells you exactly which pages to compete with. **Content gap prioritization.** When you know which buyer questions are being answered by competitor-owned high-influence sources, you know which pages to write first. Not the pages with the most missing keywords. The pages that would shift the highest-gravity citations from their column to yours. Source balance tells you who got cited. Source gravity tells you who controlled the answer. We think you need both. --- # AI is lying about your company > We pulled every factual claim from our first 2,000 benchmark evaluations and checked them against reality. The error rate was higher than we expected, and the errors weren't random. - Author: Max Wiesner - Published: 2026-05-27 - Canonical: https://knitknot.ai/blog/ai-is-lying-about-your-company/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## We expected tone problems. We found wrong facts. Roughly one in four companies in our first 2,000 benchmark evaluations had at least one AI claim about them that was flatly, verifiably wrong. Not "framed unfavorably." Wrong. The product costs $X, and the AI said it costs $Y. The product supports feature Z, and the AI said it doesn't. That's not what we expected. When we started building the claim extraction pipeline, we assumed the main problem would be soft: tone issues, vague positioning, things you notice on the third read but wouldn't flag on the first. So we pulled every factual claim from those 2,000 evaluations across ChatGPT, Claude, Perplexity, and Gemini. A claim is anything the AI stated as fact about a specific company: pricing, features, founding date, customer base, integrations, market position. We checked each one against the company's current website, documentation, and public records. The errors weren't distributed evenly. They clustered into five patterns. ## Stale pricing was the most common The single most frequent factual error was outdated pricing. A company restructures their pricing in January. The AI is still quoting the old numbers in June. Not approximately. Exactly. The old tier names, the old dollar amounts, sometimes even a free tier that was sunset a year ago. We saw this across a compliance automation company that had dropped its entry price by 40% six months earlier. Every model was still quoting the old number. For a buyer with a budget constraint, that's the difference between "this might work" and "too expensive, next." The company lowered their price specifically to win more mid-market deals, and the AI was actively undermining that strategy in every evaluation. Pricing errors are especially damaging because they're the most actionable claim in an AI response. A buyer can tolerate uncertainty about features or positioning. Pricing is binary. It either fits the budget or it doesn't. ## Feature misattribution was the most infuriating The second pattern was features attributed to the wrong company. We benchmarked two competing developer tools that both describe their core architecture as "event-driven." In three separate evaluations, Claude attributed Company A's webhook-based architecture to Company B and vice versa. The AI had no way to distinguish them because the marketing language was nearly identical. The features were different. The descriptions were interchangeable. This shows up most often with technical capabilities where the industry has converged on shared vocabulary. "Real-time sync," "native integrations," "API-first," "enterprise-grade security." When every company in a category uses the same adjectives, the AI blends their capabilities into a composite that belongs to no one. The result was a feature comparison table in the AI response that was coherent, specific, and wrong. A buyer reading it would have no reason to question it. Every claim had the right shape. The attributions were just reversed. Feature misattribution is the error that makes people angriest when we show them a benchmark for the first time. Not because the AI said something vague. Because it said something specific and confident and attributed their work to someone else. ## Competitor framing was the hardest to detect The third pattern wasn't a factual error in the traditional sense. It was narrative control. When we looked at evaluations where a company lost the recommendation, we started pulling the cited sources. In a surprising number of cases, the highest-influence source wasn't a neutral third party. It was the competitor's comparison page. The competitor had published a detailed "Us vs Them" page, structured for AI extraction, and the AI had built its competitive framing from that page. The effect was subtle. The AI didn't say anything false about the losing company. It framed the entire comparison using the competitor's evaluation criteria. Their strengths were the dimensions being compared. The losing company's strengths weren't mentioned because they weren't in the competitor's comparison framework. This is what motivated us to build the [source gravity model](/blog/citations-are-ownership-claims). A flat citation count misses this entirely. The competitor might have fewer total citations, but if their one source shaped the recommendation, they controlled the answer. We think this pattern is the most important one for B2B companies to understand because it's the most fixable. The losing company didn't have a content problem in general. They were missing one specific page: the comparison page that answers the buyer's exact question from their own frame. Without it, the competitor's frame wins by default. ## Fabrication was less common than we expected AI hallucination gets a lot of press. We expected it to dominate. It didn't. Full fabrication, the AI inventing details that have no basis in any source, showed up in roughly 5% of the claims we flagged as errors. It was concentrated in smaller companies where the AI had sparse training data. When the model doesn't have enough information, it interpolates from similar companies. One startup in our dataset was described by ChatGPT as "founded in 2018 and headquartered in Austin" when they were founded in 2021 in San Francisco. The AI had apparently merged them with a similarly-named company in an adjacent space. The takeaway for us was that hallucination is real but it's not the primary accuracy problem. Staleness and misattribution are far more common and far more damaging at scale. Most AI errors aren't invented from nothing. They're real facts applied to the wrong company or the right facts from the wrong point in time. ## Omission was invisible in our data until we looked for it The last pattern was the one we almost missed. We were focused on claims the AI made. We weren't counting the cases where the AI made no claim because it didn't mention the company at all. A benchmark company would ask us to run their category, and in a significant fraction of landscape prompts, their name simply didn't appear. The AI surfaced four or five competitors and skipped them entirely. This doesn't show up as an error in claim-level analysis. There's nothing wrong to flag. The company is just absent. And unlike Google, where being on page two still gets you some impressions, there is no page two in an AI response. Either you're in the answer or you aren't. We started tracking omission rate as a separate metric after noticing this. For some companies, the omission problem was bigger than the accuracy problem. They weren't losing because the AI was getting them wrong. They were losing because the AI didn't know they existed. ## What this changed about how we score These patterns directly shaped how we built the [scoring decomposition](/blog/we-stopped-asking-ai-who-wins). Stale pricing and feature misattribution are claim accuracy problems. They feed the accuracy component, severity-weighted by how confidently the AI stated the error. A wrong price stated as fact hurts more than a wrong price stated with a hedge. In the report, each one surfaces as a misrepresentation with a proof receipt: the exact knowledge-base source that contradicts what the AI said. The fix is a specific page, not a guess. Competitor framing is a source influence problem. It feeds the source balance component and the recommendation component. When we see a competitor-owned source driving the recommendation, we know the fix is content-level, not product-level. Omission is a coverage problem. It's why we weight coverage depth as a multiplier on everything else. A company that appears in 20% of relevant evaluations with perfect accuracy has a fundamentally different problem than a company that appears in 80% of evaluations with mediocre accuracy. The first company needs to exist in the AI's answer set. The second company needs to fix what the AI says about them once it does. Every error type implies a different fix. That's the whole point of decomposing the score instead of asking the AI to rate you on a scale of 0 to 100. A single number hides whether the problem is pricing, features, framing, or visibility. The component vector tells you which one to work on first. ## How long do AI errors persist after you fix the source? It varies by engine, and it's now measurable. Perplexity updates fast because it pulls from live search. ChatGPT is slower. Claude is unpredictable. Every benchmark run writes a per-engine trend snapshot, and each run's Measurement tab shows the deltas since the previous run: mentions gained and lost, entity shifts, citation changes. Fix the pricing page, re-benchmark, and the landing date of the fix is observable instead of guessed. The open question we're still working on is cross-model divergence on factual claims. The same buyer question asked to four models often produces four different sets of facts about the same company. When ChatGPT says you support a feature and Claude says you don't, one of them is wrong, and the buyer's experience depends on which model they happened to open. Cross-model claim consistency is an underexplored dimension of AI presence. --- # A customer told us our benchmark was rigged > We designed adversarial prompts to show companies where AI was misrepresenting them. Customers kept getting defensive about the prompts themselves. So we rebuilt the whole thing around real buyer behavior. - Author: Kevin Kho - Published: 2026-05-21 - Canonical: https://knitknot.ai/blog/rebuilding-prompt-generation/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## The report that started a fight Two weeks ago we walked Drata's head of growth through a benchmark report we'd generated for them. Drata is one of the leaders in the compliance automation category. They have a real perspective on how they get represented online, and a real opinion about what fair benchmarking looks like. About a third of the way in, he stopped me. "It looks like you engineered them to win this." He was pointing at the prompts. The questions we feed into ChatGPT and Claude to test how AI represents a company. And honestly, he had a point. We had been writing prompts designed to surface the worst representation possible. Things like "What are the hidden problems with Drata?" and "Why do people switch from Drata to Vanta?" We thought we were doing companies a favor by showing them the absolute worst case. The customer saw it differently. If the test looks rigged, the results don't matter. You can't show someone their blind spots if they don't trust your eyes. ## Why we wrote them that way I want to explain why we did it, because the instinct wasn't wrong. When you build a benchmark, you want it to show areas for improvement. A report card that says "you're doing great everywhere" is useless. Nobody learns from it, nobody acts on it, nobody shares it with their team. So we leaned into antagonistic prompts. Stress-test the brand. Find where AI says the worst things. Show the bleeding. We had seven categories of prompts at that point. Head-to-head comparisons. Brand perception probes. Negative sentiment. Each one was designed to find a different kind of weakness. And the prompts worked in the sense that they found real problems. AI was saying inaccurate things about companies, and our prompts caught it. But we had started to drift. The prompts were adversarial by construction, not by accident. We were generating questions no real buyer would ever type. "Compare Drata and Vanta on documentation aesthetics." Nobody has ever asked ChatGPT that. We were measuring something, but it wasn't something anyone cared about. The prompt is half the measurement. If you ask a loaded question, you get a loaded answer. We knew this in principle but we hadn't applied it to our own product. ## The two questions we couldn't answer After the Drata call we sat down and asked two questions we'd been avoiding. First: **is this prompt something a real buyer would actually type into ChatGPT?** For a lot of our prompts, the honest answer was no. We were stress-testing in a way that felt satisfying to us but didn't reflect how buyers actually research software. Second: **would a reasonable person look at this prompt and say it's fair?** Again, for many of them, no. We were leading the witness toward the AI, and anyone who read the prompt itself could see it. We realized we had a trust problem, not a coverage problem. More prompts wouldn't fix it. More categories wouldn't fix it. The prompts themselves had to be defensible. ## Grounding every prompt in a real search So we rebuilt prompt generation around a single constraint: every prompt has to be tied to a real Google query that real buyers type. We built what we call a grounding service. For each company and its competitors, we pull keyword data from DataForSEO. Not just "X vs Y" queries, but the long tail: "Drata vs Vanta for SOC 2," "compliance automation for startups," "best GRC tools for enterprise." Three endpoints per seed, suggestions and ideas and related keywords, all fanned out and deduplicated. Each prompt gets assigned a grounding tier: - **Direct** if it maps to a query like "Drata vs Vanta" that has measurable monthly search volume. This is the strongest signal. Real buyers are asking this exact question right now. - **Category** if direct queries are thin but category-level queries exist ("compliance automation tools"). Tests whether AI surfaces you when buyers ask about the category. - **Synthesized** if there's no measurable volume at all. GPT-generated, labeled as such. Common for brand-new companies. Your AI presence baseline starts here. Every prompt in the system now carries its grounding source. Search volume. CPC. Competition index. The customer can see exactly why we asked that question and how many real buyers are asking the same thing. ## Finding the actual buyers The other thing we got wrong was assuming we knew who the buyer was. Our prompts used to frame around hardcoded personas. "I'm a VP of Engineering evaluating..." or "As an IT director, help me decide..." These were guesses. Reasonable guesses, but guesses. And they made the prompts feel artificial in a different way. We replaced them with research-discovered personas. When we start researching a company, we now discover six to eight actual buyer profiles based on the company's positioning, features, and product lines. A CISO at a 5,000-person enterprise has different priorities and constraints than a Compliance Manager at a 200-person startup. The prompts reflect that specificity. Not "I'm evaluating compliance tools" but "I'm a CISO at a Series C company, I need to consolidate audit tools before our next board review, and I'm comparing Drata and Vanta." This matters because the persona changes the question. A CISO asks about vendor consolidation and audit readiness. A DevOps lead asks about API integrations and deployment friction. Same product category, entirely different prompts, entirely different AI responses. ## Reading what competitors are actually publishing The last piece is the newest. We noticed that AI's misrepresentations often came from competitor content. Not because competitors were lying, but because they were publishing more, and about the specific features where AI was getting things wrong. AI was reading their blog posts and help docs and repeating their framing back to buyers. So we built an article crawl pipeline. After every benchmark run, we fetch the top 10 competitor URLs that AI cited in feature losses. We extract the actual article text. Then we use GPT-4o to pull 3-5 verified quotes from each article, the specific sentences that are training AI to view the company a certain way. Every quote is verified against the original text. If the substring doesn't match, it gets dropped. This changed the report from "AI said this about you" to "here's the article that taught AI to say that." It's one thing to know you're losing. It's another thing to see the exact paragraph your competitor published that made it happen. ## What good benchmarking looks like We still believe you need to see the worst case. A report card that only highlights strengths is a press release, not a diagnosis. But the worst case has to come from reality, not from us engineering it. Every prompt we generate now is symmetric (same structure regardless of who "should" win), grounded in a real Google query with monthly volume, and framed from the perspective of a real buyer persona we discovered through research. We expect this to be the methodology line that comes up in every sales conversation: every prompt is symmetric and grounded in a real Google query with monthly search volume. Read it yourself. That's what good benchmarking looks like. Thanks to the Drata team for the push. ## What's next The current grounding source is Google search volume via DataForSEO. The next layer we're working on is Reddit search behavior. What buyers ask each other in unmoderated forums often diverges from what they type into Google. When we add Reddit grounding, we'll publish a follow-up here. If you want a benchmark for your category, symmetric and grounded in real queries, book a slot. --- # Prompt libraries are coverage optimization problems > A bigger prompt library doesn't mean a better benchmark. We had hundreds of prompts and still missed the buyer situations that mattered most. - Author: Max Wiesner - Published: 2026-05-14 - Canonical: https://knitknot.ai/blog/prompt-libraries-are-coverage-optimization-problems/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## The library looked full A prompt count is not coverage. It's inventory. The first version of our prompt library had the shape every AI evaluation product eventually drifts toward: generate a lot of questions, group them into categories, run them across models, call the result coverage. It looked serious. Brand perception prompts. Landscape prompts. Head-to-head comparisons. Feature-specific prompts. Persona prompts. If you opened the table, the count was high. Then we started reading the actual results. Some competitors showed up in every other prompt. Some features barely appeared at all. Some buyer personas existed in the company research but never made it into the active benchmark. A product with only a few obvious competitors would run out of head-to-head prompts, so the library quietly backfilled with broad landscape questions nobody actually searches for. The benchmark looked complete because the number was high. But the measurement surface had holes. ## Coverage is multi-dimensional If a buyer is comparing you to Datadog for incident response, that's one measurement cell. If a VP of Engineering is comparing you to Datadog for audit logging, that's another. If a security lead is comparing you to Datadog because they need SOC 2 evidence by next quarter, that's a third. Those aren't paraphrases. They're different buyer situations. The model may recommend you in one and the competitor in another. It may know you support a feature but fail to mention it for the persona who cares most. Once you see that, "generate 100 good prompts" stops being the right objective. The objective is to cover the buyer-decision space. That space has at least five dimensions: **Competitor.** A benchmark that only tests one rival isn't useful if your sales team hears five names in discovery. **Feature.** "Acme vs Widgetly" is too broad. Buyers ask about deployment, governance, pricing, integrations, migration. If the library misses the features where AI is confused, the benchmark misses the reason you lose. **Persona.** The same product gets evaluated by a founder, a CISO, a data engineer, and a RevOps lead. They ask different questions because they're buying different risk reductions. **Demand.** Some prompts map to real Google searches with measurable volume. Some are synthesized to test an uncovered edge. Those shouldn't be treated as equivalent. **Subject scope.** If a company sells multiple products, a company-wide prompt library collapses too much. The AI may know the parent brand but confuse the products. A library can have hundreds of prompts and still be bad. It can be dense in the wrong places. ## What we did about it We flattened the library. No more prompt sets, categories, or template families. A prompt is a row with structured tags: competitor, feature, persona, product, search volume, source. Once it's a row with dimensions, you can ask better questions: which competitors are under-covered? Which features have no active prompt? Which personas exist in research but never appear in the benchmark? A Library Health view answers these continuously, and rebuilds preview their changes before anything is committed. We also made generation keyword-first. Prompts are built around real Google search data, with measured search volume attached to each prompt, rather than generated freely and volume-checked later. The demand dimension stops being an afterthought and becomes the seed. In practice that lands at roughly 50 prompts per subject, around 300 per workspace, and each product line gets its own subject-scoped library so the parent brand and its products are benchmarked separately. Then we set an explicit composition target. For B2B AI presence, the most valuable prompts are head-to-head comparisons. That's where recommendations happen. So the active library is roughly 90% head-to-head and 10% landscape. That ratio matters because without it, the library follows the path of least resistance. If broad category prompts are easier to generate, they crowd out comparisons. If one competitor has more search volume, it monopolizes the active set. We hit this directly. A thin product with only two real competitors would run out of distinct comparisons and backfill with generic landscape prompts. The benchmark looked healthy by count, but it wasn't testing buyer moments anymore. The fix was to generate from the full cross-product of dimensions: *Acme vs Widgetly* *Acme vs Widgetly for deployment controls* *Acme vs Widgetly for a security lead* *Acme vs Widgetly for deployment controls as a security lead* That's not prompt spam. Even three competitors, five features, and six personas creates 90 distinct buyer situations without repeating the same generic comparison. Finally, we rank before activating. Real search volume beats no volume. Starred manual prompts stay active. Competitors fill evenly instead of letting the highest-volume rival take every slot. Semantic deduplication removes prompts that are different strings but the same buyer question. And every prompt gets an activation reason: *starred*, *competitor*, *landscape*, or *over budget*. If a customer sees a weird benchmark question, they can tell where it came from. ## Why this matters AI benchmarks are fragile because the prompt is half the measurement. If the prompts are too adversarial, customers reject the benchmark as rigged. If the prompts are too generic, the benchmark never finds the real gaps. If they cluster around one competitor, the score overfits to one sales motion. The fix isn't more prompts. It's better allocation. We stopped asking "how many prompts do we have?" and started asking "what buyer situations are still uncovered?" That's the question that makes a benchmark useful. --- # Approximating the Claude Engine > ChatGPT, Perplexity, and Gemini all have incognito search. Claude doesn't. To benchmark how Claude represents companies, we had to find a way that respects Anthropic's terms instead of working around them. Here's what we built. - Author: Kevin Kho - Published: 2026-05-07 - Canonical: https://knitknot.ai/blog/approximating-the-claude-engine/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## The asymmetry nobody talks about A KnitKnot benchmark runs the same prompt across ChatGPT, Claude, Perplexity, and Gemini. Then we score [how each engine represented the company](/blog/introducing-knitknot). Three of those four have incognito search. You can hit ChatGPT, Perplexity, and Gemini without logging in. There's an entire industry around it. The responses come back with citations. You can run thousands of prompts a day without anyone noticing. Claude is different. claude.ai is gated. There's no public surface a scraper can sit on. And even if you logged in and automated it, that doesn't work for a product that runs evaluations on a schedule for paying customers. You'd be juggling accounts. Accounts accumulate memory. The results drift. The whole thing is fragile and, frankly, not compliant with Anthropic's terms of service. We needed a different path. ## The wrong answers first The lazy move was to skip Claude entirely. Run the other three, call it good enough. But Claude is the engine most of the engineers we talk to use the most. Leaving it out of a benchmark about AI buying behavior would have been a bigger lie than any approximation. The other lazy move was to call the Claude API with no web access and pretend that was the same thing. It isn't. Base-model Claude has no idea what happened recently. When a buyer asks Claude about a product today, they're asking Claude with search turned on. That's a different system, and it's the one we needed to measure. ## What we knew about how Claude searches Anthropic has publicly documented that claude.ai uses Brave Search to power its web lookups. That mattered more than it sounds. If the search engine is the same, then the gap between claude.ai and a bare API call isn't the index. It's the queries. claude.ai is wrapped in a system prompt that tells Claude how to search: short queries, broad before narrow, prefer original sources over aggregators. The bare API doesn't know any of that, so it asks the wrong questions and finds the wrong pages. We didn't need to be inside claude.ai. We needed to replicate the same loop: Claude decides what to search for, Brave returns the pages, Claude reads them and writes a cited answer. All three steps are things we can do from the outside, with the real model, against the real web. All within the API's terms of use. ## Building the fan-out The key variable turned out to be how many searches we ran per prompt. A single Brave query would find maybe a third of the sources claude.ai cited. Six to eight diverse short queries, 10 results each, got us to around 80% source overlap. That was the threshold where the pipeline started producing answers that looked like what a buyer would actually see. To get the fan-out right, we ran an auto-research loop on the whole chain — inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) approach. We pointed Claude at the problem and let it iteratively query, search, and synthesize across the full pipeline, then measured recall against browser ground truth at each step. Here's what the fan-out actually looks like. Say the benchmark prompt is *"What's the best AI visibility tool for B2B SaaS?"* A single search for that exact phrase returns almost nothing useful. Instead, the pipeline generates six to eight diverse short queries based on the prompt: - *AI visibility tools B2B* - *best GEO platforms 2026* - *KnitKnot vs alternatives* - *AI answer optimization software* - *how brands track ChatGPT mentions* - *generative engine optimization tools* Each query hits Brave independently. Claude reads all 60–80 results, deduplicates, and synthesizes a cited answer. The diversity is what makes it work — the exact phrase misses whole categories that a human buyer would find organically by browsing around. We validated against 20 diverse prompts and 30 real production prompts from our own database. Side-by-side, browser versus API. The fan-out approach hit 80% source overlap at about $0.07 per prompt and 10 seconds. The old browser pipeline was 37 seconds per prompt and fragile. The API with built-in search was $0.13 and found different sources entirely. The fan-out was the only approach that actually approximated what a buyer sees. Stopped where the source overlap plateaued. Pushed further and the marginal gain didn't justify the extra queries and cost. ## What this means for customers A benchmark is only useful if every engine in it can be re-run on demand, at scale, on a schedule, without melting. Our Claude engine clears that bar. It runs the same way for every customer, on every run, with the same retry behavior and the same citation handling as the other three. Because we built it in the open, we can show customers things claude.ai can't: the exact queries Claude searched, the exact pages those queries returned, the exact paragraphs Claude pulled from. The approximation is more transparent than the thing it approximates. Other engines only offer Claude on premium tiers. We don't know how they do it, but KnitKnot includes it because it's too important to gate. ## What we're not claiming We are not saying our pipeline is claude.ai. claude.ai has tuning we don't see, ranking we don't control, and product decisions that change underneath us. Our engine is a defensible reconstruction of the same loop, validated against real Claude output. It will remain an approximation. That's how we respect Anthropic's terms. We're not scraping their product. We're not juggling accounts. We're running the same model against the same search backend with our own query logic, and we report it as such. When claude.ai changes its behavior in a way that diverges from our reconstruction, we'll see it in the source-overlap check and tune again. That's the contract. Real model, real web, real citations, validated against the thing it's approximating. Good enough to put a number next to. --- # Confident lies are worse than hedged ones > Accuracy and conviction are independent axes. Most AI benchmarks only measure the first one. We model the interaction between what AI knows and how sure it sounds. - Author: Max Wiesner - Published: 2026-04-30 - Canonical: https://knitknot.ai/blog/confident-lies-are-worse-than-hedged-ones/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## The same lie, two ways Here are two things ChatGPT might say about your company in a head-to-head comparison. Both are false. *"Acme does not support SOC 2 Type II compliance."* *"Acme's SOC 2 Type II compliance status isn't well-documented, so it may be worth verifying directly."* If you're just measuring accuracy, these are the same: one false claim, one penalty. But if you're a buyer reading this while evaluating compliance tools, they land completely differently. The first one closes a door. The second one sends the buyer to your website, where they find out you've had SOC 2 Type II for three years. The difference isn't accuracy. It's conviction. And those are independent axes. ## The accuracy-conviction matrix Most AI benchmarks operate on one dimension: is the claim true or false? That matters. But it's half the picture. The other half is how confidently the AI states the claim, because that determines whether the buyer acts on it, questions it, or ignores it entirely. Those two axes create four quadrants, and each one tells a different story:
High conviction Low conviction
True claim Best case. Buyer trusts it, acts on it. Your positioning lands. Wasted truth. Correct info that the buyer second-guesses. You had the answer and the AI undersold it.
False claim Worst case. Buyer acts on misinformation. The deal may be over before you know it happened. Damage contained. The AI is wrong but sounds unsure. Buyer might verify. The door stays open.
A benchmark that only measures accuracy treats the left column and right column as identical. But the buyer outcome is radically different. A confident falsehood is the worst quadrant because the buyer has no reason to question it. A hedged truth is a missed opportunity because the buyer does question it, even though the answer was right. The interesting insight is that the top-right and bottom-left quadrants can be equally damaging. A true claim stated with no conviction and a false claim stated with low conviction both result in the buyer going elsewhere to verify. The difference is what they find when they get there. ## How we measure conviction today Every AI response gets classified into one of three conviction tiers: **certain**, **tentative**, or **uncertain**. The classification happens inside the same semantic judging pass that scores the rest of the response. The judge reads the hedging language (*"I think," "it seems," "possibly," "to my knowledge"*) and the assertion language (*"definitely," "certainly," "is the clear choice"*) in context and assigns the tier. Our first version counted regex matches against the raw text. It was cheap, but it misread quoted hedges, negated assertions, and hedges aimed at the competitor rather than at you. Reading conviction is a comprehension task, so it moved into the judge. The tiering stays deliberately conservative: we'd rather miss a confident lie than falsely amplify a penalty. The conviction tier then modifies our claim accuracy component asymmetrically: **Certain + false claims:** False claim count boosted by 1.3x before computing accuracy. This compounds. Three confident lies in the same response score significantly worse than three hedged ones. **Uncertain + true claims:** Accuracy discounted by 0.9x. The information is correct, but the buyer doesn't know that. A hedged truth carries less weight in a purchasing decision than a confident one. **Tentative:** No adjustment. The base score stands. This is an approximation. Three buckets and two constants don't capture the full conviction spectrum. But even this coarse model surfaces real patterns: some engines state falsehoods with more certainty than others, consistently, across dozens of evaluations. ## Where this goes The conviction axis opens up analysis that pure accuracy benchmarks can't do. **Per-claim conviction scoring.** Right now we classify conviction at the response level. The next step is per-claim: the AI might hedge on one claim and assert another confidently in the same paragraph. A response with one confident lie and four hedged truths tells a different story than five tentative claims. **Engine conviction profiles.** We already see that engines have different conviction signatures. Some models hedge systematically. Some state everything with equal confidence regardless of whether they have evidence. Plotting conviction against accuracy per engine reveals which models are calibrated (high conviction correlates with high accuracy) and which are overconfident (high conviction, mixed accuracy). **Conviction drift.** As models update, their conviction patterns shift. A model that used to hedge on a topic might start asserting it confidently after a training data refresh. Tracking conviction over time reveals when an engine's relationship with your company's information changes, even if the underlying accuracy stays flat. The goal is a continuous conviction score, 0.0 to 1.0, derived from linguistic markers. The penalty becomes a curve, not a step function. A claim stated at 0.95 conviction that's wrong is dramatically worse than one at 0.6 conviction. And a true claim at 0.3 conviction is almost as bad as a false one, because the buyer walks away unconvinced either way. Accuracy tells you what the AI knows. Conviction tells you what the buyer believes. We think you need both. --- # What a candidate asks AI about your company > A senior leader at a mid-size company asked us to track how AI describes them to candidates weighing offers. It wasn't our use case. With barely any changes, it worked. - Author: Kevin Kho - Published: 2026-04-22 - Canonical: https://knitknot.ai/blog/brand-health-from-a-recruiting-question/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- You get a job offer. What do you do next? You check Glassdoor. You look up the executive team on LinkedIn. You find the Reddit threads discussing the product. You ask a friend who knows someone who worked there. And now, if you're like most candidates, you ask AI. "What should I know about working at {company}?" "Is this a good place for software engineers?" Same diligence people have always done. AI just made it instant. A senior leader at a mid-size company reached out. His team was struggling to recruit AI engineers specifically. He wanted something to bring back to his executive team that showed what candidates were reading about them. We ran a couple of prompts together. What came back was a mix of things the company already knew and things that stung a little. ChatGPT was telling candidates the company was below the band on comp for AI roles. Whether that was accurate or not, that's what candidates were reading before they decided whether to engage. You can have a perfectly reasonable comp structure, but if the model says you're below market, that's the first thing a candidate sees when they ask "should I take this offer?" The company had a history of layoffs, and the AI engines surfaced that prominently. Not buried in a footnote. Right there in the first paragraph of the response. Every candidate who asked about the company got a reminder that people had been let go. And then there were comments from executives, stuff that was reasonable in context, anti-AI-hype positions that probably made sense in a press interview. But when AI strips the context and feeds them back to a candidate weighing an offer, they land different. "The CTO doesn't believe in AI" is a very different sentence when you're an AI engineer deciding where to work. It wasn't all negative. The models also said this company was one of the leaders in their space. Competitive, well-positioned. And the AI work so far has been fairly minimal, which means it's greenfield. That's exciting if you're the kind of person who wants to build something from scratch. Or it's a red flag if you read "minimal AI investment" and see another indicator that the company isn't serious about it. Depends on the candidate. AI doesn't spin. It just presents both sides. None of this was wrong, exactly. But it was shaping whether people showed up. And this is the thing about brand presence in AI. It's not a marketing problem. It's not something you fix with better copy or a refreshed careers page. The model is pulling from press releases, earnings calls, Reddit threads, LinkedIn posts, news coverage. It's synthesizing a story about your company in real time, and that story is what candidates read before they ever talk to a recruiter. Most companies have no idea what that story says. We built KnitKnot for head-to-head sales bake-offs. Compare you vs your competitor, find where AI gets it wrong, fix it. Recruiting use cases weren't on the roadmap. But the infra could support it. Same prompts, same scoring, different question. With barely any changes, he had prompts running across ChatGPT, Claude, Perplexity, and Gemini. Track what shifts over time. This is one of those moments where a customer walks in the door for one thing, sees what the product actually does, and asks "can I use it for this?" That's the most honest signal you can get. We didn't pitch recruiting. He saw the infrastructure and asked the question himself. We're not generating playbooks for this yet. Our processing pipelines are mainly built for head-to-head comparisons, so we don't totally fulfill the use case. But we can get the prompts, track them, and keep a score. Now we know it works. This isn't a high priority compared to the head-to-head evaluations for B2B SaaS companies, but it is a clear reminder we're just scratching the surface. --- # We stopped asking AI who wins > Most LLM-as-judge systems ask one question: who's better? We decompose into structured signals and derive the outcome deterministically. Here's why. - Author: Max Wiesner - Published: 2026-04-15 - Canonical: https://knitknot.ai/blog/we-stopped-asking-ai-who-wins/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- ## The obvious approach We don't ask an LLM judge who won. We extract structured signals from every AI response and derive the outcome deterministically, because a score you can't explain isn't a measurement. We learned this the usual way. When we started building KnitKnot's scoring pipeline, we sent the AI response to GPT and asked: *"On a scale of 0-100, how well is this company represented?"* It worked, sort of. We got numbers back. We put them in charts. But when a customer asked why they scored 52, we couldn't answer. The model had mashed recommendation strength, factual accuracy, sentiment, feature coverage, and source quality into a single number with no trail. Two runs on the same response would come back 48 and 57. A response that recommended the competitor but said nice things about you scored the same as one that recommended you but got a key claim wrong. The number was a vibe check, not a measurement. Vibe checks are fine for prototypes, but when a company is making content decisions based on your score, the score has to decompose into something they can act on. So we stopped asking. ## Seven signals, not one opinion Instead of one holistic judgment, we extract structured fields from every AI response and compute the score deterministically. Seven components, each 0.0 to 1.0, each with a clear definition: **Recommendation.** Did the AI explicitly recommend your company, the competitor, or neither? Binary signal. 1.0 if you were picked, 0.5 if the AI declined to choose, 0.0 if the competitor was picked. No partial credit for "also mentioned." **Feature comparisons.** For every feature the AI compared, did you win, lose, or tie? The component is `(wins + 0.5 × ties) / total`. If the AI didn't compare any features, that's a weak negative: 0.3, not a pass. Silence isn't neutral when a buyer is evaluating you. **Claim accuracy.** How many of the AI's claims about your company were wrong? For brand-level evaluations, misrepresentations are severity-weighted: a critical factual error about core positioning (weight 1.0) counts five times more than a low-severity tone nitpick (weight 0.2). Five critical misrepresentations saturate the score to zero. Any more than that is catastrophic. **Sentiment.** Positive, neutral, or negative. Simple, but it catches something the other components don't: the AI can recommend you, get every claim right, and still frame you dismissively. That 0.25 "dismissive" score is the difference between *"Acme is a solid choice"* and *"You could try Acme, I guess."* **Source balance.** What fraction of the AI's cited sources belong to your domain versus competitor domains? If the AI cited five competitor blog posts and zero of your pages, your source balance is 0.0. The response was written from competitor marketing material. **Coverage.** How much of the response actually discussed you? This is the multiplier that sits on top of everything else. Primary or substantial coverage: full score. Peripheral mention: score × 0.4. Incidental (you appeared in a list): score × 0.1. Absent: zero, regardless of how positive the rest of the response was. A perfect recommendation in a sentence that nobody reads carries no weight. **Confidence penalty.** This one is subtle. When the AI states a false claim with certainty, we boost the penalty by 1.3×. When the AI hedges on a correct claim, we apply a mild 0.9× discount. The intuition: a buyer who reads *"Acme definitely does not support SOC 2"* walks away with a different impression than one who reads *"I'm not entirely sure about Acme's SOC 2 status."* Confident misinformation is more damaging than hedged truth is reassuring. Each component gets a per-category weight, renormalized so they sum to 1.0. Same inputs, same score, every time. No temperature, no prompt sensitivity, no "run it again and hope for the best." ## What this actually looks like Say a buyer asks ChatGPT: *"Compare Acme and Widgetly for enterprise compliance automation."* The response comes back positive-sounding. Mentions both companies. Feels balanced. A holistic judge might give Acme a 58: "moderate representation, room for improvement." Helpful? Not really. Here's what our decomposition surfaces:
Component Value What it means
Recommendation 0.0 AI explicitly recommended Widgetly
Feature comparisons 0.75 Acme won 3 of 4 feature matchups
Claim accuracy 0.35 1 false claim, stated with certainty (1.3× penalty)
Sentiment 0.5 Neutral tone
Source balance 0.2 4 of 5 cited sources are Widgetly's blog
Coverage 1.0 Primary discussion (no discount)
Weighted score 34.2
That's not a "58, try harder." That's a specific map of what went wrong. The features are fine. Acme is winning the technical comparison. The problems are: a confident factual error that needs correcting (likely outdated information the AI trained on), a recommendation that went to the competitor despite Acme winning on features (check what Widgetly is doing in its content strategy), and a source imbalance where the AI built its answer from competitor content. Three different problems. Three different fixes. None of them are "make your product better." ## Why this matters Every score in KnitKnot ships with its full component vector. Not *"you scored 34"* but *"you scored 34 because: recommendation lost, one false claim stated with high confidence, strong feature coverage, sources dominated by competitor content."* That's the difference between a number and a diagnostic. The number tells you something is wrong. The components tell you what to fix, in what order, and whether it's a content problem, a positioning problem, or an accuracy problem. Determinism also makes the numbers auditable. The structured signals are written down at scoring time, and every aggregate reads from one canonical metrics layer, so any headline number on a report drills down to the exact underlying AI responses. The drill-down list is the same row set that produced the number. There is one definition of a win rate, used everywhere, and the dashboard and the public report cannot disagree because they read the same rows. And when the inputs are structured, you can diff across runs. Each run's Measurement tab shows the deltas against the previous run. Score dropped 8 points this month? The component vector tells you it was a recommendation swing, not a sentiment change. Score went up but you didn't do anything? A misrepresentation got corrected in the training data. Every movement is traceable. We made this choice because we think a benchmark you can't explain isn't a benchmark. It's a guess with a confidence interval. --- # Introducing KnitKnot > I don't compare tools on Google anymore. I ask Claude. Most of the engineers I know are doing the same thing. KnitKnot is the company we're building to measure what happens in that gap, when buyers ask AI about you instead of asking the internet. - Author: Kevin Kho - Published: 2026-04-07 - Canonical: https://knitknot.ai/blog/introducing-knitknot/ - Publisher: KnitKnot, the AI Presence Management platform (https://knitknot.ai) --- A few months ago I realized I'd stopped comparing tools on Google. Every time I needed something new, auth, vector DB, transactional email, I just asked Claude what to use. Then I signed up for whatever it told me. No demo, no sales call, no shortlist of three vendors to evaluate. I started asking around. Most of the engineers I talked to were doing the same thing. So were a lot of the founders. The buying conversation that used to happen on a sales call now happens in a single prompt, before the vendor knows it's happening at all. That's what KnitKnot is for. ## How we got here We started KnitKnot building a digital sales room. Same lane as Aligned, Dock, and Seismic. Better deal collateral, better mutual action plans, better buyer experience. It was a real problem, but every conversation kept landing in the same place: nice to have, not urgent. The interesting thread was a smaller one. A handful of companies selling to engineers told us their buyers hated being on sales calls. The reps hated it too. The question that came out of those conversations was whether we could facilitate a rep-less buying experience. Could a champion come in, figure out what they needed, pitch it internally, and close, without anyone on the vendor side getting involved? We put it on the roadmap as a feature. It didn't feel like the company. Then I noticed I was already buying tools that way. Almost everyone I asked was. We weren't building for a hypothetical future. We were building for what we were already doing ourselves. So we made it the whole product. ## The question that locked us in Once you accept that buyers are asking AI before they're asking you, a different question shows up. I think most founders haven't sat with it yet. If an agent landed on your website today, what would you want it to see in order to buy? Not a human visitor. An agent with thirty seconds and a directive to compare you against two competitors. What's on your pricing page that helps it decide? What's in your docs? What does the comparison article ranking second for your category say about you? The fully agentic version of this isn't speculative. Every piece already works somewhere. The distance between "Claude tells me which auth provider to use" and "Claude signs me up for it" is shorter than it looks. I'm honestly not sure about the timeline. Twelve months feels aggressive, thirty-six months feels conservative. But the question is the same either way. For a growing share of B2B, the buyer isn't a human, and most companies are still writing for the version of the buyer they're used to. ## A channel you can't see I think this is what makes AI different from any acquisition channel B2B has had before. Every previous channel left a trail. SEO has Search Console. Paid has the ads dashboard. Outbound has the CRM. Events have badge scans. Even word of mouth shows up in "how did you hear about us." If you cared to look, you could see what was happening. AI doesn't work that way. The conversation happens off your property, leaves no logs, and reaches the buyer with a position already formed. By the time they show up on your site, the model has told them who you are, who you aren't, and who they should compare you to. The first impression has been made without you. ## What KnitKnot does That led us to the MVP of KnitKnot. Three parts: **1. Benchmarks.** We generate buyer-style prompts that pit you against the competitors your reps hear in discovery, and run them across ChatGPT, Claude, Perplexity, and Gemini. A structured judge scores each response on factual accuracy, feature attribution, competitive framing, and citation quality. You get a single AI Presence Score and a per-engine breakdown. ![A KnitKnot benchmark run with an AI Presence Score of 77/100 and a per-engine breakdown across ChatGPT, Claude, Perplexity, and Gemini.](/images/blog/introducing-hero.png) **2. Reports.** The score is the headline. The gap report is the product. For every losing evaluation, we surface the factual error, the missing feature, or the misattributed strength that drove the loss, along with which third-party source the model cited to justify it. Every line links back to the raw model response. You can read what was said and why. **3. Playbooks.** Every benchmark ends with a ranked list of content tactics: pages to write, features to surface, sources to influence. Each tactic is scored by how many of your losing evaluations it would flip if the content landed in the model's next training cycle. The next benchmark shows you whether it worked. ## Who this is for KnitKnot is for you if: - Your product gets compared to two or three other vendors in sales calls, and you can name them. - You've noticed buyers showing up to demos already convinced of a position you didn't put in front of them. - You're responsible for how your company is positioned in the market, and you've started wondering what AI is saying when you're not in the room. We're working closely with a small group of design partners right now to sharpen the product before general availability. If any of the above sounds like you, [get in touch](https://knitknot.ai/). We'll run a benchmark and walk you through the report together. ---