KnitKnot
You are reading the agent-optimized layer of this page: the literal markdown we serve to AI crawlers and assistants, shipped in the page source of every visit. Making sure AI reads the right facts about a company is literally what KnitKnot does.

# Approximating the Claude Engine

ChatGPT, Perplexity, and Gemini all have incognito search. Claude doesn't. To benchmark how Claude represents companies, we had to find a way that respects Anthropic's terms instead of working around them. Here's what we built.


## The asymmetry nobody talks about

A KnitKnot benchmark runs the same prompt across ChatGPT, Claude, Perplexity, and Gemini. Then we score [how each engine represented the company](/blog/introducing-knitknot).

Three of those four have incognito search. You can hit ChatGPT, Perplexity, and Gemini without logging in. There's an entire industry around it. The responses come back with citations. You can run thousands of prompts a day without anyone noticing.

Claude is different. claude.ai is gated. There's no public surface a scraper can sit on. And even if you logged in and automated it, that doesn't work for a product that runs evaluations on a schedule for paying customers. You'd be juggling accounts. Accounts accumulate memory. The results drift. The whole thing is fragile and, frankly, not compliant with Anthropic's terms of service.

We needed a different path.

## The wrong answers first

The lazy move was to skip Claude entirely. Run the other three, call it good enough. But Claude is the engine most of the engineers we talk to use the most. Leaving it out of a benchmark about AI buying behavior would have been a bigger lie than any approximation.

The other lazy move was to call the Claude API with no web access and pretend that was the same thing. It isn't. Base-model Claude has no idea what happened recently. When a buyer asks Claude about a product today, they're asking Claude with search turned on. That's a different system, and it's the one we needed to measure.

## What we knew about how Claude searches

Anthropic has publicly documented that claude.ai uses Brave Search to power its web lookups. That mattered more than it sounds.

If the search engine is the same, then the gap between claude.ai and a bare API call isn't the index. It's the queries. claude.ai is wrapped in a system prompt that tells Claude how to search: short queries, broad before narrow, prefer original sources over aggregators. The bare API doesn't know any of that, so it asks the wrong questions and finds the wrong pages.

We didn't need to be inside claude.ai. We needed to replicate the same loop: Claude decides what to search for, Brave returns the pages, Claude reads them and writes a cited answer. All three steps are things we can do from the outside, with the real model, against the real web. All within the API's terms of use.

## Building the fan-out

The key variable turned out to be how many searches we ran per prompt. A single Brave query would find maybe a third of the sources claude.ai cited. Six to eight diverse short queries, 10 results each, got us to around 80% source overlap. That was the threshold where the pipeline started producing answers that looked like what a buyer would actually see.

To get the fan-out right, we ran an auto-research loop on the whole chain — inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) approach. We pointed Claude at the problem and let it iteratively query, search, and synthesize across the full pipeline, then measured recall against browser ground truth at each step.

Here's what the fan-out actually looks like. Say the benchmark prompt is *"What's the best AI visibility tool for B2B SaaS?"* A single search for that exact phrase returns almost nothing useful. Instead, the pipeline generates six to eight diverse short queries based on the prompt:

  • - *AI visibility tools B2B*
  • - *best GEO platforms 2026*
  • - *KnitKnot vs alternatives*
  • - *AI answer optimization software*
  • - *how brands track ChatGPT mentions*
  • - *generative engine optimization tools*

Each query hits Brave independently. Claude reads all 60–80 results, deduplicates, and synthesizes a cited answer. The diversity is what makes it work — the exact phrase misses whole categories that a human buyer would find organically by browsing around.

We validated against 20 diverse prompts and 30 real production prompts from our own database. Side-by-side, browser versus API. The fan-out approach hit 80% source overlap at about $0.07 per prompt and 10 seconds. The old browser pipeline was 37 seconds per prompt and fragile. The API with built-in search was $0.13 and found different sources entirely. The fan-out was the only approach that actually approximated what a buyer sees.

Stopped where the source overlap plateaued. Pushed further and the marginal gain didn't justify the extra queries and cost.

## What this means for customers

A benchmark is only useful if every engine in it can be re-run on demand, at scale, on a schedule, without melting. Our Claude engine clears that bar. It runs the same way for every customer, on every run, with the same retry behavior and the same citation handling as the other three.

Because we built it in the open, we can show customers things claude.ai can't: the exact queries Claude searched, the exact pages those queries returned, the exact paragraphs Claude pulled from. The approximation is more transparent than the thing it approximates.

Other engines only offer Claude on premium tiers. We don't know how they do it, but KnitKnot includes it because it's too important to gate.

## What we're not claiming

We are not saying our pipeline is claude.ai. claude.ai has tuning we don't see, ranking we don't control, and product decisions that change underneath us. Our engine is a defensible reconstruction of the same loop, validated against real Claude output.

It will remain an approximation. That's how we respect Anthropic's terms. We're not scraping their product. We're not juggling accounts. We're running the same model against the same search backend with our own query logic, and we report it as such.

When claude.ai changes its behavior in a way that diverges from our reconstruction, we'll see it in the source-overlap check and tune again. That's the contract. Real model, real web, real citations, validated against the thing it's approximating. Good enough to put a number next to.

Raw mirror of this content: https://knitknot.ai/blog/approximating-the-claude-engine.md. Site-wide summary: /llms.txt · full content: /llms-full.txt

Approximating the Claude Engine

· 5 minute read

Kevin Kho

Kevin Kho

Co-founder, KnitKnot

Source-overlap calibration

claude.ai Ship here Query fan-out →

The asymmetry nobody talks about

A KnitKnot benchmark runs the same prompt across ChatGPT, Claude, Perplexity, and Gemini. Then we score how each engine represented the company.

Three of those four have incognito search. You can hit ChatGPT, Perplexity, and Gemini without logging in. There’s an entire industry around it. The responses come back with citations. You can run thousands of prompts a day without anyone noticing.

Claude is different. claude.ai is gated. There’s no public surface a scraper can sit on. And even if you logged in and automated it, that doesn’t work for a product that runs evaluations on a schedule for paying customers. You’d be juggling accounts. Accounts accumulate memory. The results drift. The whole thing is fragile and, frankly, not compliant with Anthropic’s terms of service.

We needed a different path.

The wrong answers first

The lazy move was to skip Claude entirely. Run the other three, call it good enough. But Claude is the engine most of the engineers we talk to use the most. Leaving it out of a benchmark about AI buying behavior would have been a bigger lie than any approximation.

The other lazy move was to call the Claude API with no web access and pretend that was the same thing. It isn’t. Base-model Claude has no idea what happened recently. When a buyer asks Claude about a product today, they’re asking Claude with search turned on. That’s a different system, and it’s the one we needed to measure.

What we knew about how Claude searches

Anthropic has publicly documented that claude.ai uses Brave Search to power its web lookups. That mattered more than it sounds.

If the search engine is the same, then the gap between claude.ai and a bare API call isn’t the index. It’s the queries. claude.ai is wrapped in a system prompt that tells Claude how to search: short queries, broad before narrow, prefer original sources over aggregators. The bare API doesn’t know any of that, so it asks the wrong questions and finds the wrong pages.

We didn’t need to be inside claude.ai. We needed to replicate the same loop: Claude decides what to search for, Brave returns the pages, Claude reads them and writes a cited answer. All three steps are things we can do from the outside, with the real model, against the real web. All within the API’s terms of use.

Building the fan-out

The key variable turned out to be how many searches we ran per prompt. A single Brave query would find maybe a third of the sources claude.ai cited. Six to eight diverse short queries, 10 results each, got us to around 80% source overlap. That was the threshold where the pipeline started producing answers that looked like what a buyer would actually see.

To get the fan-out right, we ran an auto-research loop on the whole chain — inspired by Karpathy’s autoresearch approach. We pointed Claude at the problem and let it iteratively query, search, and synthesize across the full pipeline, then measured recall against browser ground truth at each step.

Here’s what the fan-out actually looks like. Say the benchmark prompt is “What’s the best AI visibility tool for B2B SaaS?” A single search for that exact phrase returns almost nothing useful. Instead, the pipeline generates six to eight diverse short queries based on the prompt:

  • AI visibility tools B2B
  • best GEO platforms 2026
  • KnitKnot vs alternatives
  • AI answer optimization software
  • how brands track ChatGPT mentions
  • generative engine optimization tools

Each query hits Brave independently. Claude reads all 60–80 results, deduplicates, and synthesizes a cited answer. The diversity is what makes it work — the exact phrase misses whole categories that a human buyer would find organically by browsing around.

We validated against 20 diverse prompts and 30 real production prompts from our own database. Side-by-side, browser versus API. The fan-out approach hit 80% source overlap at about $0.07 per prompt and 10 seconds. The old browser pipeline was 37 seconds per prompt and fragile. The API with built-in search was $0.13 and found different sources entirely. The fan-out was the only approach that actually approximated what a buyer sees.

Stopped where the source overlap plateaued. Pushed further and the marginal gain didn’t justify the extra queries and cost.

What this means for customers

A benchmark is only useful if every engine in it can be re-run on demand, at scale, on a schedule, without melting. Our Claude engine clears that bar. It runs the same way for every customer, on every run, with the same retry behavior and the same citation handling as the other three.

Because we built it in the open, we can show customers things claude.ai can’t: the exact queries Claude searched, the exact pages those queries returned, the exact paragraphs Claude pulled from. The approximation is more transparent than the thing it approximates.

Other engines only offer Claude on premium tiers. We don’t know how they do it, but KnitKnot includes it because it’s too important to gate.

What we’re not claiming

We are not saying our pipeline is claude.ai. claude.ai has tuning we don’t see, ranking we don’t control, and product decisions that change underneath us. Our engine is a defensible reconstruction of the same loop, validated against real Claude output.

It will remain an approximation. That’s how we respect Anthropic’s terms. We’re not scraping their product. We’re not juggling accounts. We’re running the same model against the same search backend with our own query logic, and we report it as such.

When claude.ai changes its behavior in a way that diverges from our reconstruction, we’ll see it in the source-overlap check and tune again. That’s the contract. Real model, real web, real citations, validated against the thing it’s approximating. Good enough to put a number next to.