KnitKnot
You are reading the agent-optimized layer of this page: the literal markdown we serve to AI crawlers and assistants, shipped in the page source of every visit. Making sure AI reads the right facts about a company is literally what KnitKnot does.

# We stopped asking AI who wins

Most LLM-as-judge systems ask one question: who's better? We decompose into structured signals and derive the outcome deterministically. Here's why.


## The obvious approach

We don't ask an LLM judge who won. We extract structured signals from every AI response and derive the outcome deterministically, because a score you can't explain isn't a measurement.

We learned this the usual way. When we started building KnitKnot's scoring pipeline, we sent the AI response to GPT and asked: *"On a scale of 0-100, how well is this company represented?"* It worked, sort of. We got numbers back. We put them in charts. But when a customer asked why they scored 52, we couldn't answer. The model had mashed recommendation strength, factual accuracy, sentiment, feature coverage, and source quality into a single number with no trail. Two runs on the same response would come back 48 and 57. A response that recommended the competitor but said nice things about you scored the same as one that recommended you but got a key claim wrong.

The number was a vibe check, not a measurement. Vibe checks are fine for prototypes, but when a company is making content decisions based on your score, the score has to decompose into something they can act on.

So we stopped asking.

## Seven signals, not one opinion

Instead of one holistic judgment, we extract structured fields from every AI response and compute the score deterministically. Seven components, each 0.0 to 1.0, each with a clear definition:

**Recommendation.** Did the AI explicitly recommend your company, the competitor, or neither? Binary signal. 1.0 if you were picked, 0.5 if the AI declined to choose, 0.0 if the competitor was picked. No partial credit for "also mentioned."

**Feature comparisons.** For every feature the AI compared, did you win, lose, or tie? The component is `(wins + 0.5 × ties) / total`. If the AI didn't compare any features, that's a weak negative: 0.3, not a pass. Silence isn't neutral when a buyer is evaluating you.

**Claim accuracy.** How many of the AI's claims about your company were wrong? For brand-level evaluations, misrepresentations are severity-weighted: a critical factual error about core positioning (weight 1.0) counts five times more than a low-severity tone nitpick (weight 0.2). Five critical misrepresentations saturate the score to zero. Any more than that is catastrophic.

**Sentiment.** Positive, neutral, or negative. Simple, but it catches something the other components don't: the AI can recommend you, get every claim right, and still frame you dismissively. That 0.25 "dismissive" score is the difference between *"Acme is a solid choice"* and *"You could try Acme, I guess."*

**Source balance.** What fraction of the AI's cited sources belong to your domain versus competitor domains? If the AI cited five competitor blog posts and zero of your pages, your source balance is 0.0. The response was written from competitor marketing material.

**Coverage.** How much of the response actually discussed you? This is the multiplier that sits on top of everything else. Primary or substantial coverage: full score. Peripheral mention: score × 0.4. Incidental (you appeared in a list): score × 0.1. Absent: zero, regardless of how positive the rest of the response was. A perfect recommendation in a sentence that nobody reads carries no weight.

**Confidence penalty.** This one is subtle. When the AI states a false claim with certainty, we boost the penalty by 1.3×. When the AI hedges on a correct claim, we apply a mild 0.9× discount. The intuition: a buyer who reads *"Acme definitely does not support SOC 2"* walks away with a different impression than one who reads *"I'm not entirely sure about Acme's SOC 2 status."* Confident misinformation is more damaging than hedged truth is reassuring.

Each component gets a per-category weight, renormalized so they sum to 1.0. Same inputs, same score, every time. No temperature, no prompt sensitivity, no "run it again and hope for the best."

## What this actually looks like

Say a buyer asks ChatGPT: *"Compare Acme and Widgetly for enterprise compliance automation."*

The response comes back positive-sounding. Mentions both companies. Feels balanced. A holistic judge might give Acme a 58: "moderate representation, room for improvement." Helpful? Not really.

Here's what our decomposition surfaces:

<div style="margin: 2em 0; border: 1px solid hsl(var(--border)); border-radius: 8px; overflow: hidden;"> <table style="width: 100%; border-collapse: collapse; font-size: 14px;"> <thead> <tr style="background: hsl(var(--muted) / 0.5);"> <th style="text-align: left; padding: 10px 16px; font-weight: 600; font-size: 11px; text-transform: uppercase; letter-spacing: 0.06em; color: hsl(var(--muted-foreground)); border-bottom: 1px solid hsl(var(--border));">Component</th> <th style="text-align: center; padding: 10px 16px; font-weight: 600; font-size: 11px; text-transform: uppercase; letter-spacing: 0.06em; color: hsl(var(--muted-foreground)); border-bottom: 1px solid hsl(var(--border)); white-space: nowrap;">Value</th> <th style="text-align: left; padding: 10px 16px; font-weight: 600; font-size: 11px; text-transform: uppercase; letter-spacing: 0.06em; color: hsl(var(--muted-foreground)); border-bottom: 1px solid hsl(var(--border));">What it means</th> </tr> </thead> <tbody> <tr> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); font-weight: 500;">Recommendation</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); text-align: center; font-family: var(--font-mono, monospace); color: #FC5043;">0.0</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); color: hsl(var(--muted-foreground));">AI explicitly recommended Widgetly</td> </tr> <tr> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); font-weight: 500;">Feature comparisons</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); text-align: center; font-family: var(--font-mono, monospace); color: #3DB0C4;">0.75</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); color: hsl(var(--muted-foreground));">Acme won 3 of 4 feature matchups</td> </tr> <tr> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); font-weight: 500;">Claim accuracy</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); text-align: center; font-family: var(--font-mono, monospace); color: #F67202;">0.35</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); color: hsl(var(--muted-foreground));">1 false claim, stated with certainty (1.3× penalty)</td> </tr> <tr> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); font-weight: 500;">Sentiment</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); text-align: center; font-family: var(--font-mono, monospace); color: hsl(var(--muted-foreground));">0.5</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); color: hsl(var(--muted-foreground));">Neutral tone</td> </tr> <tr> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); font-weight: 500;">Source balance</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); text-align: center; font-family: var(--font-mono, monospace); color: #FC5043;">0.2</td> <td style="padding: 10px 16px; border-bottom: 1px solid hsl(var(--border) / 0.5); color: hsl(var(--muted-foreground));">4 of 5 cited sources are Widgetly's blog</td> </tr> <tr> <td style="padding: 10px 16px; font-weight: 500;">Coverage</td> <td style="padding: 10px 16px; text-align: center; font-family: var(--font-mono, monospace); color: #3DB0C4;">1.0</td> <td style="padding: 10px 16px; color: hsl(var(--muted-foreground));">Primary discussion (no discount)</td> </tr> </tbody> <tfoot> <tr style="background: hsl(var(--muted) / 0.3);"> <td style="padding: 12px 16px; border-top: 1px solid hsl(var(--border)); font-size: 12px; text-transform: uppercase; letter-spacing: 0.06em; color: hsl(var(--muted-foreground)); font-weight: 500;">Weighted score</td> <td style="padding: 12px 16px; border-top: 1px solid hsl(var(--border)); text-align: center; font-size: 20px; font-weight: 600; font-family: var(--font-mono, monospace); color: hsl(var(--foreground));">34.2</td> <td style="padding: 12px 16px; border-top: 1px solid hsl(var(--border));"></td> </tr> </tfoot> </table> </div>

That's not a "58, try harder." That's a specific map of what went wrong. The features are fine. Acme is winning the technical comparison. The problems are: a confident factual error that needs correcting (likely outdated information the AI trained on), a recommendation that went to the competitor despite Acme winning on features (check what Widgetly is doing in its content strategy), and a source imbalance where the AI built its answer from competitor content.

Three different problems. Three different fixes. None of them are "make your product better."

## Why this matters

Every score in KnitKnot ships with its full component vector. Not *"you scored 34"* but *"you scored 34 because: recommendation lost, one false claim stated with high confidence, strong feature coverage, sources dominated by competitor content."*

That's the difference between a number and a diagnostic. The number tells you something is wrong. The components tell you what to fix, in what order, and whether it's a content problem, a positioning problem, or an accuracy problem.

Determinism also makes the numbers auditable. The structured signals are written down at scoring time, and every aggregate reads from one canonical metrics layer, so any headline number on a report drills down to the exact underlying AI responses. The drill-down list is the same row set that produced the number. There is one definition of a win rate, used everywhere, and the dashboard and the public report cannot disagree because they read the same rows.

And when the inputs are structured, you can diff across runs. Each run's Measurement tab shows the deltas against the previous run. Score dropped 8 points this month? The component vector tells you it was a recommendation swing, not a sentiment change. Score went up but you didn't do anything? A misrepresentation got corrected in the training data. Every movement is traceable.

We made this choice because we think a benchmark you can't explain isn't a benchmark. It's a guess with a confidence interval.

Raw mirror of this content: https://knitknot.ai/blog/we-stopped-asking-ai-who-wins.md. Site-wide summary: /llms.txt · full content: /llms-full.txt

We stopped asking AI who wins

· 8 minute read

Max Wiesner

Max Wiesner

Co-founder, KnitKnot

REC FEAT ACC SENT SRC COV Score

Signal decomposition

The obvious approach

We don’t ask an LLM judge who won. We extract structured signals from every AI response and derive the outcome deterministically, because a score you can’t explain isn’t a measurement.

We learned this the usual way. When we started building KnitKnot’s scoring pipeline, we sent the AI response to GPT and asked: “On a scale of 0-100, how well is this company represented?” It worked, sort of. We got numbers back. We put them in charts. But when a customer asked why they scored 52, we couldn’t answer. The model had mashed recommendation strength, factual accuracy, sentiment, feature coverage, and source quality into a single number with no trail. Two runs on the same response would come back 48 and 57. A response that recommended the competitor but said nice things about you scored the same as one that recommended you but got a key claim wrong.

The number was a vibe check, not a measurement. Vibe checks are fine for prototypes, but when a company is making content decisions based on your score, the score has to decompose into something they can act on.

So we stopped asking.

Seven signals, not one opinion

Instead of one holistic judgment, we extract structured fields from every AI response and compute the score deterministically. Seven components, each 0.0 to 1.0, each with a clear definition:

Recommendation. Did the AI explicitly recommend your company, the competitor, or neither? Binary signal. 1.0 if you were picked, 0.5 if the AI declined to choose, 0.0 if the competitor was picked. No partial credit for “also mentioned.”

Feature comparisons. For every feature the AI compared, did you win, lose, or tie? The component is (wins + 0.5 × ties) / total. If the AI didn’t compare any features, that’s a weak negative: 0.3, not a pass. Silence isn’t neutral when a buyer is evaluating you.

Claim accuracy. How many of the AI’s claims about your company were wrong? For brand-level evaluations, misrepresentations are severity-weighted: a critical factual error about core positioning (weight 1.0) counts five times more than a low-severity tone nitpick (weight 0.2). Five critical misrepresentations saturate the score to zero. Any more than that is catastrophic.

Sentiment. Positive, neutral, or negative. Simple, but it catches something the other components don’t: the AI can recommend you, get every claim right, and still frame you dismissively. That 0.25 “dismissive” score is the difference between “Acme is a solid choice” and “You could try Acme, I guess.”

Source balance. What fraction of the AI’s cited sources belong to your domain versus competitor domains? If the AI cited five competitor blog posts and zero of your pages, your source balance is 0.0. The response was written from competitor marketing material.

Coverage. How much of the response actually discussed you? This is the multiplier that sits on top of everything else. Primary or substantial coverage: full score. Peripheral mention: score × 0.4. Incidental (you appeared in a list): score × 0.1. Absent: zero, regardless of how positive the rest of the response was. A perfect recommendation in a sentence that nobody reads carries no weight.

Confidence penalty. This one is subtle. When the AI states a false claim with certainty, we boost the penalty by 1.3×. When the AI hedges on a correct claim, we apply a mild 0.9× discount. The intuition: a buyer who reads “Acme definitely does not support SOC 2” walks away with a different impression than one who reads “I’m not entirely sure about Acme’s SOC 2 status.” Confident misinformation is more damaging than hedged truth is reassuring.

Each component gets a per-category weight, renormalized so they sum to 1.0. Same inputs, same score, every time. No temperature, no prompt sensitivity, no “run it again and hope for the best.”

What this actually looks like

Say a buyer asks ChatGPT: “Compare Acme and Widgetly for enterprise compliance automation.”

The response comes back positive-sounding. Mentions both companies. Feels balanced. A holistic judge might give Acme a 58: “moderate representation, room for improvement.” Helpful? Not really.

Here’s what our decomposition surfaces:

Component Value What it means
Recommendation 0.0 AI explicitly recommended Widgetly
Feature comparisons 0.75 Acme won 3 of 4 feature matchups
Claim accuracy 0.35 1 false claim, stated with certainty (1.3× penalty)
Sentiment 0.5 Neutral tone
Source balance 0.2 4 of 5 cited sources are Widgetly's blog
Coverage 1.0 Primary discussion (no discount)
Weighted score 34.2

That’s not a “58, try harder.” That’s a specific map of what went wrong. The features are fine. Acme is winning the technical comparison. The problems are: a confident factual error that needs correcting (likely outdated information the AI trained on), a recommendation that went to the competitor despite Acme winning on features (check what Widgetly is doing in its content strategy), and a source imbalance where the AI built its answer from competitor content.

Three different problems. Three different fixes. None of them are “make your product better.”

Why this matters

Every score in KnitKnot ships with its full component vector. Not “you scored 34” but “you scored 34 because: recommendation lost, one false claim stated with high confidence, strong feature coverage, sources dominated by competitor content.”

That’s the difference between a number and a diagnostic. The number tells you something is wrong. The components tell you what to fix, in what order, and whether it’s a content problem, a positioning problem, or an accuracy problem.

Determinism also makes the numbers auditable. The structured signals are written down at scoring time, and every aggregate reads from one canonical metrics layer, so any headline number on a report drills down to the exact underlying AI responses. The drill-down list is the same row set that produced the number. There is one definition of a win rate, used everywhere, and the dashboard and the public report cannot disagree because they read the same rows.

And when the inputs are structured, you can diff across runs. Each run’s Measurement tab shows the deltas against the previous run. Score dropped 8 points this month? The component vector tells you it was a recommendation swing, not a sentiment change. Score went up but you didn’t do anything? A misrepresentation got corrected in the training data. Every movement is traceable.

We made this choice because we think a benchmark you can’t explain isn’t a benchmark. It’s a guess with a confidence interval.