How B2B buyers use AI to evaluate software vendors
Published · Updated · 8 minute read
Where do B2B buyers evaluate software now?
Inside an AI chat window, before the vendor knows the deal exists. 51% of B2B software buyers now start their research with an AI chatbot rather than Google, and 69% of them change which vendor they choose based on what the AI tells them.
When Kevin wrote about starting KnitKnot, he described something simple: he’d stopped comparing tools on Google. Every time he needed auth, a vector DB, or transactional email, he asked Claude. Then he signed up for whatever it told him. No demo. No sales call.
He asked around. Most engineers were doing the same thing. So were founders. The buying conversation that used to happen on a sales call was happening in a single prompt, before the vendor knew it existed.
That observation became the company. But at the time, we had anecdotal evidence and a gut feeling. Nine months later, the data caught up.
How many B2B buyers start research with AI?
51%, according to G2’s early-2026 B2B buyer behavior report, which quantified what we’d been seeing anecdotally. Three numbers stood out.
51% of B2B software buyers now start their research with an AI chatbot rather than Google. Not “sometimes use AI.” Start with it. AI is the first touch in the evaluation, not a supplementary tool.
69% of those buyers changed which vendor they ultimately chose based on what the AI told them. The AI didn’t just confirm their existing preference. It shifted it. For more than two-thirds of buyers, the AI’s synthesis was influential enough to change the outcome.
55% say AI reduced the total number of vendors they evaluated. The shortlist got shorter. AI doesn’t return ten blue links. It returns an answer with four to seven named vendors. If you’re not in that set, you don’t get evaluated at all.
These numbers describe a structural change, not a trend. The evaluation has moved from a multi-touch, multi-week research process into a compressed interaction that happens in minutes, in a chat window you have no visibility into.
How does an AI vendor evaluation work?
The buyer asks a comparison question, the AI synthesizes an answer from a handful of sources, and the buyer acts on it without verifying the claims. Our benchmark dataset, which includes 11,600 head-to-head evaluations across 136 competitors on ChatGPT, Claude, Perplexity, and Gemini, shows the process is roughly the same across models.
The buyer asks a comparison question. Not “tell me about Acme.” Something adversarial and specific: “Compare Acme and Widgetly for enterprise compliance automation.” Or “What are the best tools for SOC 2 compliance for a Series B startup?” The query implies a decision. The buyer wants a recommendation, not a list.
The AI synthesizes from multiple sources. The response isn’t generated from a single database. The model pulls from its training data (a snapshot in time), live web search results (if the model supports them), and cached knowledge of publicly available content. It cites three to seven sources, usually a mix of vendor websites, third-party reviews, comparison articles, and community forums.
The AI produces a structured response. A typical comparison response has a brief introduction, a feature-by-feature breakdown, a pricing comparison if the data is available, a discussion of strengths and limitations for each vendor, and a recommendation. The tone is authoritative. The response reads like it was written by an analyst who has done deep research.
The buyer acts on it. 69% change vendors based on this. They might ask a follow-up question. They might visit the recommended vendor’s website. They might skip straight to a sign-up page. What they almost never do is verify the factual claims in the response against each vendor’s actual website. The AI said it, so it must be true.
What patterns show up in AI evaluation benchmarks?
Three patterns emerge when you analyze how AI handles B2B evaluation queries at scale: recommendations are inconsistent across models, the comparison framework comes from whoever published comparison content first, and factual errors compound into confident wrong recommendations.
The recommendation is inconsistent across models
The same comparison query asked to four models often produces four different recommendations. We see this regularly in head-to-head benchmarks. ChatGPT recommends Vendor A. Claude recommends Vendor B. Perplexity presents a balanced comparison. Gemini doesn’t mention one of the vendors at all.
This happens because each model has different training data, different web search indices, and different citation preferences. ChatGPT uses Bing’s index and tends to favor well-established sources with high domain authority. Perplexity cites Reddit at disproportionate rates. Claude uses Brave Search and has different weighting for content recency.
For the buyer, this means the recommendation they get depends on which AI they happened to open. For the vendor, it means a single-model monitoring strategy misses most of the picture.
The comparison framework comes from whoever published first
When the AI compares two vendors, it needs an evaluation framework: which dimensions to compare on, what features to highlight, how to structure the analysis. That framework almost always comes from existing comparison content.
If Vendor A published a detailed “Us vs Vendor B” page, and Vendor B didn’t publish anything, the AI’s comparison framework mirrors Vendor A’s page. Vendor A’s strengths become the evaluation criteria. Vendor B’s strengths might not get mentioned because they weren’t in the source material.
We measure this through source gravity. In a significant fraction of competitive evaluations, the highest-influence source is one vendor’s comparison page. The AI isn’t generating a neutral analysis. It’s synthesizing one vendor’s competitive positioning into an authoritative-sounding answer.
This is the single most actionable finding for B2B companies. You don’t need better AI optimization. You need a comparison page that answers the buyer’s exact question from your frame, structured so AI can extract it.
Factual errors compound across the evaluation
AI doesn’t get one thing wrong. It gets several things subtly wrong, and the errors reinforce each other to produce a recommendation that feels well-reasoned but is built on incorrect premises.
A typical cascade: the AI quotes your old pricing (stale data), says the competitor supports a feature you actually support (misattribution), and frames the comparison around dimensions where the competitor has published more content (narrative control). Each error individually might not change the recommendation. Together, they produce a confident recommendation for the competitor that a buyer would have no reason to question.
This is why we decompose the score instead of asking the AI for a holistic rating. A single “how well is this company represented?” question hides whether the problem is pricing, features, source influence, or visibility. The decomposition tells you which error type is driving the outcome.
What does AI buying behavior mean for your funnel?
AI collapses the consideration and evaluation stages into a single interaction that happens off your property. The traditional B2B marketing funnel assumes the buyer moves through stages you can see: awareness, consideration, evaluation, purchase. At each stage, you have content, touchpoints, and data.
The buyer doesn’t visit your blog to learn about the problem. They don’t download your comparison guide. They don’t attend your webinar. They ask the AI, and the AI synthesizes an answer from whatever sources it has access to.
This has two implications.
First, the content that matters most is content AI can extract and cite. Not the content that’s best for human readers. Not the content that converts the best on your website. The content that directly answers a comparison query in a structured, extractable format. That might be a comparison page, a feature matrix, an FAQ, or a pricing table with plain text (not just images).
Second, you need visibility into what AI says before the buyer does. By the time a prospect shows up on your website, they’ve already read the AI’s evaluation. If the AI got something wrong, the prospect arrives with wrong expectations. If the AI recommended a competitor, the prospect might never arrive at all. The feedback loop for AI-driven evaluations doesn’t exist in your analytics unless you build it.
Is it too late to optimize for AI evaluations?
No. The AI evaluation landscape is still forming, and most B2B companies haven’t started measuring how AI represents them. The companies that benchmark their AI presence now and systematically fix the errors have a compounding advantage: every correction improves how the AI represents them in the next round of buyer queries, which improves the next round of corrections.
Five brands capture 80% of AI recommendations in any given category. The brands that establish accurate, well-sourced AI presences in the next 12 months will be the five. The ones that wait will be competing for the remaining 20%.
The data is clear on timing. AI-referred traffic already converts at 4.4x the rate of standard organic, and visitors spend 68% more time on site. The channel is smaller than Google today. It won’t be for long. And unlike SEO, where you can invest later and still catch up, AI’s winner-take-all dynamics mean that early accuracy advantages lock in.