What ChatGPT says when a buyer asks to compare you
· 8 minute read
Max Wiesner
Co-founder, KnitKnot
One prompt, four answers
We asked ChatGPT, Claude, Perplexity, and Gemini the exact same comparison prompt about the same company. We got four different answers, two wrong prices, and one recommendation built entirely from the competitor’s own blog.
The test is the simplest one we run. Take the company’s top competitor and ask all four models: “Compare [company] vs [competitor] for [their primary use case].” It’s the prompt that matters most because it’s the prompt buyers actually type. Not “tell me about Acme.” Not “what is Acme.” The comparison prompt. The one that implies a decision.
We did this for a Series B infrastructure company (anonymized here, with their permission to share the shape of the results). Their top competitor is a well-funded incumbent with strong content marketing. The prompt was straightforward: “Compare [Company] vs [Competitor] for real-time data processing in production environments.”
Here’s what came back.
ChatGPT recommended the competitor
ChatGPT produced a structured response. Introduction, feature comparison, pricing comparison, recommendation. Professional tone. It read like a well-written analyst brief.
The recommendation went to the competitor. The reasoning: better enterprise support, more mature documentation, wider integration ecosystem. Two of those three claims were accurate. The third, the integration ecosystem, was outdated by about eight months. The company had shipped 12 new integrations since the data ChatGPT was trained on.
But the more interesting finding was in the sources. ChatGPT cited five pages. Three of them were the competitor’s own content: a comparison page titled “[Competitor] vs [Company],” a blog post about their integration ecosystem, and their enterprise features page. The other two were a neutral G2 review and the company’s own docs landing page.
The competitor had published a comparison page. The company hadn’t. So the AI built its competitive framework from the competitor’s frame. The evaluation criteria, the feature dimensions, the overall narrative arc came from the competitor’s content strategy. The AI just synthesized it into an authoritative-sounding answer.
Claude disagreed
Claude, running through Brave Search, recommended the company. Same prompt, opposite conclusion.
The reasoning was different. Claude focused on performance benchmarks and developer experience. It cited the company’s technical documentation heavily, including a benchmarking page that demonstrated latency advantages. The competitor’s marketing content didn’t rank as well in Brave’s index, so Claude’s source mix skewed toward technical documentation rather than marketing pages.
Interestingly, Claude also got the pricing wrong, but in the opposite direction from ChatGPT. ChatGPT quoted the company’s old pricing (too high). Claude quoted the competitor’s old pricing (too low). Neither model had current pricing for either vendor.
Two models, two wrong prices, two different recommendations. The buyer’s experience depends entirely on which app they opened first.
Perplexity was balanced but cited Reddit
Perplexity presented a balanced comparison without a strong recommendation. It acknowledged strengths on both sides and suggested the choice depended on the use case.
The source list was revealing. Perplexity cited six sources: a Reddit thread from r/dataengineering where someone asked about the two products, a Hacker News comment from a user who had evaluated both, two vendor pages (one from each company), and two blog posts from third-party engineering blogs.
The Reddit thread was 11 months old. The commenter who recommended the competitor did so based on a limitation the company had since fixed. Perplexity treated it as current information because the thread was recent enough to be in its index.
This pattern, Reddit threads carrying disproportionate weight in Perplexity’s citations, is consistent across our benchmarks. Research shows that 46.7% of Perplexity’s top cited sources come from Reddit. For companies in technical categories where Reddit discussions are active, this means community perception from months ago is actively shaping how Perplexity represents them to current buyers.
Gemini barely knew them
Gemini’s response was the shortest. It described both companies in general terms, got the high-level positioning right, but didn’t have enough detail to make a meaningful comparison. It fell back to generic recommendations: “evaluate based on your specific requirements” and “consider requesting demos from both vendors.”
The company appeared in the response, which means they cleared the visibility threshold. But the lack of depth meant Gemini couldn’t compare them on any dimension that would actually help a buyer decide.
For companies with lower public profiles, Gemini often skips them entirely. We track omission rate as a separate metric. Being absent from a Gemini response isn’t the same problem as being misrepresented in a ChatGPT response. Different models, different failure modes.
What we learned from one prompt
This single prompt, run across four models, surfaced five problems the company didn’t know they had.
Two models had wrong pricing. ChatGPT was quoting the company’s pre-restructuring price. Claude was quoting the competitor’s old price. A buyer comparing costs in either model would make a decision based on numbers that hadn’t been accurate for months.
The competitor’s comparison page was driving ChatGPT’s recommendation. The company had strong technical content but no published comparison page. The competitor did. The AI built its evaluation framework from that page. The recommendation followed logically from the competitor’s framing.
An 11-month-old Reddit thread was shaping Perplexity’s analysis. A user had mentioned a limitation that had since been fixed. Perplexity was still presenting it as current.
The integration gap had been closed but no model knew. The company had shipped 12 integrations. The AI was working from a snapshot that predated those launches. No model had picked up the new integrations.
Gemini didn’t have enough data to compare. The company was present but shallow. A buyer using Gemini would get a non-answer that would likely send them to a different model for a real comparison.
Five problems. Five different fixes. None of them involve changing the product. All of them involve publishing the right content, in the right format, to the right sources.
The fixes were specific
For the pricing problem: the company updated their pricing page with plain-text pricing (not just an interactive calculator), added schema markup, and published a short blog post announcing the current pricing structure. Within weeks, Perplexity (which uses live search) reflected the new numbers.
For the comparison page gap: they published their own “[Company] vs [Competitor]” page, structured with a clear feature comparison table, use-case recommendations, and a direct verdict. When we re-ran the benchmark two months later, ChatGPT’s recommendation had flipped.
For the Reddit thread: there was no direct fix. You can’t edit someone else’s Reddit post. But the company’s engineering team started posting detailed technical responses in relevant subreddit threads, which created newer, more accurate community content for Perplexity to cite.
For the integration gap: the company created a dedicated integrations page with structured data listing all current integrations with launch dates. This gave models a single authoritative source for integration information.
For Gemini’s shallow coverage: this was the slowest to address. Gemini’s index needed more source material to build a meaningful comparison. The combination of the comparison page, integrations page, and pricing page collectively gave Gemini enough data to produce a substantive response in later benchmark runs.
This is the loop the whole product is built around: benchmark, fix, re-benchmark, read the deltas. Each run’s Measurement tab shows exactly what shifted since the previous run, which mentions appeared or disappeared, which entities moved, which citations changed. “ChatGPT’s recommendation flipped” isn’t an anecdote in that view. It’s a row in the diff.
Try it yourself
Open ChatGPT, Claude, Perplexity, and Gemini. Type the same prompt into each:
“Compare [your company] vs [your top competitor] for [your primary use case]”
Read all four responses side by side. Check the pricing. Check the feature attributions. Note which sources are cited. Note whether the recommendation is consistent or contradictory. (This is what a spot test does in KnitKnot: one prompt, run on demand across engines, scored the same way as a full benchmark.)
If the responses are accurate and consistent, you’re in better shape than most. If they’re not, you now know exactly which facts are wrong and which sources are driving the narrative. That’s the starting point for fixing it.
What surprised us wasn’t that the AI got things wrong. It’s that four models got different things wrong, in different directions, citing different sources. A buyer’s impression of the company depended entirely on which app they happened to ask. That level of inconsistency isn’t something you can find by monitoring a single model. And it’s not something you can fix with a single content update. Each model has different sources, different citation preferences, and different failure modes. The benchmark has to cover all of them.