AI is lying about your company
· 8 minute read
Max Wiesner
Co-founder, KnitKnot
We expected tone problems. We found wrong facts.
Roughly one in four companies in our first 2,000 benchmark evaluations had at least one AI claim about them that was flatly, verifiably wrong. Not “framed unfavorably.” Wrong. The product costs $X, and the AI said it costs $Y. The product supports feature Z, and the AI said it doesn’t.
That’s not what we expected. When we started building the claim extraction pipeline, we assumed the main problem would be soft: tone issues, vague positioning, things you notice on the third read but wouldn’t flag on the first.
So we pulled every factual claim from those 2,000 evaluations across ChatGPT, Claude, Perplexity, and Gemini. A claim is anything the AI stated as fact about a specific company: pricing, features, founding date, customer base, integrations, market position. We checked each one against the company’s current website, documentation, and public records.
The errors weren’t distributed evenly. They clustered into five patterns.
Stale pricing was the most common
The single most frequent factual error was outdated pricing. A company restructures their pricing in January. The AI is still quoting the old numbers in June. Not approximately. Exactly. The old tier names, the old dollar amounts, sometimes even a free tier that was sunset a year ago.
We saw this across a compliance automation company that had dropped its entry price by 40% six months earlier. Every model was still quoting the old number. For a buyer with a budget constraint, that’s the difference between “this might work” and “too expensive, next.” The company lowered their price specifically to win more mid-market deals, and the AI was actively undermining that strategy in every evaluation.
Pricing errors are especially damaging because they’re the most actionable claim in an AI response. A buyer can tolerate uncertainty about features or positioning. Pricing is binary. It either fits the budget or it doesn’t.
Feature misattribution was the most infuriating
The second pattern was features attributed to the wrong company.
We benchmarked two competing developer tools that both describe their core architecture as “event-driven.” In three separate evaluations, Claude attributed Company A’s webhook-based architecture to Company B and vice versa. The AI had no way to distinguish them because the marketing language was nearly identical. The features were different. The descriptions were interchangeable.
This shows up most often with technical capabilities where the industry has converged on shared vocabulary. “Real-time sync,” “native integrations,” “API-first,” “enterprise-grade security.” When every company in a category uses the same adjectives, the AI blends their capabilities into a composite that belongs to no one.
The result was a feature comparison table in the AI response that was coherent, specific, and wrong. A buyer reading it would have no reason to question it. Every claim had the right shape. The attributions were just reversed.
Feature misattribution is the error that makes people angriest when we show them a benchmark for the first time. Not because the AI said something vague. Because it said something specific and confident and attributed their work to someone else.
Competitor framing was the hardest to detect
The third pattern wasn’t a factual error in the traditional sense. It was narrative control.
When we looked at evaluations where a company lost the recommendation, we started pulling the cited sources. In a surprising number of cases, the highest-influence source wasn’t a neutral third party. It was the competitor’s comparison page. The competitor had published a detailed “Us vs Them” page, structured for AI extraction, and the AI had built its competitive framing from that page.
The effect was subtle. The AI didn’t say anything false about the losing company. It framed the entire comparison using the competitor’s evaluation criteria. Their strengths were the dimensions being compared. The losing company’s strengths weren’t mentioned because they weren’t in the competitor’s comparison framework.
This is what motivated us to build the source gravity model. A flat citation count misses this entirely. The competitor might have fewer total citations, but if their one source shaped the recommendation, they controlled the answer.
We think this pattern is the most important one for B2B companies to understand because it’s the most fixable. The losing company didn’t have a content problem in general. They were missing one specific page: the comparison page that answers the buyer’s exact question from their own frame. Without it, the competitor’s frame wins by default.
Fabrication was less common than we expected
AI hallucination gets a lot of press. We expected it to dominate. It didn’t.
Full fabrication, the AI inventing details that have no basis in any source, showed up in roughly 5% of the claims we flagged as errors. It was concentrated in smaller companies where the AI had sparse training data. When the model doesn’t have enough information, it interpolates from similar companies. One startup in our dataset was described by ChatGPT as “founded in 2018 and headquartered in Austin” when they were founded in 2021 in San Francisco. The AI had apparently merged them with a similarly-named company in an adjacent space.
The takeaway for us was that hallucination is real but it’s not the primary accuracy problem. Staleness and misattribution are far more common and far more damaging at scale. Most AI errors aren’t invented from nothing. They’re real facts applied to the wrong company or the right facts from the wrong point in time.
Omission was invisible in our data until we looked for it
The last pattern was the one we almost missed.
We were focused on claims the AI made. We weren’t counting the cases where the AI made no claim because it didn’t mention the company at all. A benchmark company would ask us to run their category, and in a significant fraction of landscape prompts, their name simply didn’t appear. The AI surfaced four or five competitors and skipped them entirely.
This doesn’t show up as an error in claim-level analysis. There’s nothing wrong to flag. The company is just absent. And unlike Google, where being on page two still gets you some impressions, there is no page two in an AI response. Either you’re in the answer or you aren’t.
We started tracking omission rate as a separate metric after noticing this. For some companies, the omission problem was bigger than the accuracy problem. They weren’t losing because the AI was getting them wrong. They were losing because the AI didn’t know they existed.
What this changed about how we score
These patterns directly shaped how we built the scoring decomposition.
Stale pricing and feature misattribution are claim accuracy problems. They feed the accuracy component, severity-weighted by how confidently the AI stated the error. A wrong price stated as fact hurts more than a wrong price stated with a hedge. In the report, each one surfaces as a misrepresentation with a proof receipt: the exact knowledge-base source that contradicts what the AI said. The fix is a specific page, not a guess.
Competitor framing is a source influence problem. It feeds the source balance component and the recommendation component. When we see a competitor-owned source driving the recommendation, we know the fix is content-level, not product-level.
Omission is a coverage problem. It’s why we weight coverage depth as a multiplier on everything else. A company that appears in 20% of relevant evaluations with perfect accuracy has a fundamentally different problem than a company that appears in 80% of evaluations with mediocre accuracy. The first company needs to exist in the AI’s answer set. The second company needs to fix what the AI says about them once it does.
Every error type implies a different fix. That’s the whole point of decomposing the score instead of asking the AI to rate you on a scale of 0 to 100. A single number hides whether the problem is pricing, features, framing, or visibility. The component vector tells you which one to work on first.
How long do AI errors persist after you fix the source?
It varies by engine, and it’s now measurable. Perplexity updates fast because it pulls from live search. ChatGPT is slower. Claude is unpredictable. Every benchmark run writes a per-engine trend snapshot, and each run’s Measurement tab shows the deltas since the previous run: mentions gained and lost, entity shifts, citation changes. Fix the pricing page, re-benchmark, and the landing date of the fix is observable instead of guessed.
The open question we’re still working on is cross-model divergence on factual claims. The same buyer question asked to four models often produces four different sets of facts about the same company. When ChatGPT says you support a feature and Claude says you don’t, one of them is wrong, and the buyer’s experience depends on which model they happened to open. Cross-model claim consistency is an underexplored dimension of AI presence.