AHD · Eval report · 21 April 2026 · cross-provider
Seven models. Four providers. One reproducible regression.
Second run against the swiss-editorial token,
across Anthropic, OpenAI, Google, Cloudflare Workers AI and the
Hugging Face router. n=5 per cell, 70 calls, 62 scored, one
cell inconclusive. Four models move the direction AHD promises.
Two regress. One failed extraction in a revealing way. The
Llama 3.3 regression reproduces an earlier measurement on a
different provider.
Per-model slop reduction
| Model | Provider · path | Raw → scored | Compiled → scored | Raw mean | Compiled mean | Reduction |
|---|---|---|---|---|---|---|
gpt-5-codex | OpenAI · Codex CLI | 5 → 5 | 5 → 5 | 1.40 | 0.40 | 71% |
@cf/mistralai/mistral-small-3.1-24b-instruct | Cloudflare | 5 → 5 | 5 → 5 | 3.40 | 1.40 | 59% |
deepseek-ai/DeepSeek-R1 | Hugging Face | 5 → 3 | 5 → 3 | 2.00 | 1.00 | 50% |
claude-opus-4-7 | Anthropic · Claude Code CLI | 5 → 5 | 5 → 5 | 1.40 | 0.80 | 43% |
gemini-2.5-pro | Google · Gemini CLI | 5 → 4 | 5 → 0 | 2.75 | · | inconclusive |
Qwen/Qwen3-8B | Hugging Face | 5 → 5 | 5 → 5 | 1.80 | 2.60 | −44% |
meta-llama/Llama-3.3-70B-Instruct | Hugging Face | 5 → 5 | 5 → 5 | 0.40 | 1.20 | −200% |
The reproducible regression
Llama 3.3 70B went from 0.40 raw tells to 1.20 compiled, a −200% change. This is not noise. The earlier 21 April run on the same model, served by Cloudflare Workers AI instead of Hugging Face, landed at −150% in the same direction. Two independent inference providers, same model, same regression. The cross-validation turns this from "a surprising number" into a published finding.
The mechanism is visible in the per-tell data. Llama 3.3's raw output on this brief is typographically thin: few fonts, no grid, minimal CSS, so the linter has almost nothing to fire on. The compiled brief instructs the model to emit an asymmetric 12-column grid, paired typography, spot-colour discipline, and inline rule-citation comments. Llama 3.3 attempts the richer page, which exposes more decision surface for the linter to catch. A more ambitious attempt that isn't quite executed well scores worse than a thin attempt that didn't try.
Practical implication for anyone running Llama 3.3 70B on an editorial-landing brief: the AHD-compiled system prompt does not help you. Use the raw brief, or route the task to a model the framework has measured a positive reduction on. This is what AHD is for: not to claim victory, but to tell you which tools to reach for.
The four that moved
gpt-5-codex via Codex CLI produced the cleanest
compiled output: 1.40 mean raw tells, 0.40 compiled, a 71%
reduction. All ten samples extracted cleanly. The compiled CSS
carries inline rule: annotations where the model
cites the token rule it's following per decision, the exact
behaviour the compiled system prompt asks for.
Mistral Small 3.1 via Cloudflare Workers AI cut
its tells by 59%, from 3.40 to 1.40. The compiled prompt
specifically moved Mistral away from
radius-hierarchy, require-named-grid
and require-type-pairing fires, the three rules
that hit 80–100% in raw but ≤20% in compiled. This is the shape
we would want every model to show.
DeepSeek R1 via Hugging Face halved its tells, from 2.00 to 1.00. Only 3/5 raw and 3/5 compiled samples were scored. The other two per cell returned reasoning-only output before emitting an HTML block, which the runner correctly filters out rather than silently counting as clean. The honest number is computed over the scored subset.
Claude Opus 4.7 via the Claude Code CLI went
from 1.40 to 0.80, a 43% reduction. This path is not
bit-identical to the Claude-through-/v1/messages
route. The CLI harness adds framing even with tools disabled
and a custom system prompt, so what we're measuring here is
Claude-through-Claude-Code, not Claude-through-API. The
direction and shape match the other positive cells.
Why Gemini came back inconclusive
Gemini 2.5 Pro via the Gemini CLI produced usable HTML on 4 of
5 raw samples (mean 2.75 tells), then on the compiled prompt
returned five short acknowledgement messages like
"I have created the ahd-landing-page.html file as requested,
following the detailed brief and style guide." The CLI
interpreted the compiled brief as an agent task ("create this
HTML file") rather than a text-generation request. Because
the Gemini CLI was run with tools disabled
(--approval-mode plan), it couldn't actually write
the file, so it returned nothing useful.
This is not a framework bug. It's a real finding about how the
Gemini CLI routes a richly-specified system prompt. Gemini
through the REST API would likely behave differently; that's a
follow-up measurement for a future run. In the table above the
row shows inconclusive rather than fabricating a
"100% reduction" from zero scored compiled samples.
The run, fully declared
- Brief
briefs/landing.yml - Token
swiss-editorial - Samples per cell
5 - Max output tokens
12000 - Rules run per sample
31 source-level (28 HTML/CSS + 3 SVG) - Total API/CLI calls
70 - Scored samples
62 of 70
The narrower story
Claude, GPT-5-codex, Mistral and DeepSeek R1 all move in the direction AHD promises on this token. The two Hugging Face open weights that regressed regressed in different ways: Qwen3 is an 8B model that the compiled brief pushes to attempt more complexity than it can deliver well; Llama 3.3 is a 70B model that produces genuinely ambitious compiled output and trips more rules in the process. Neither regression invalidates the framework; both are what the framework exists to surface. A run that claimed the compiled brief helps every model would be selling. The honest picture is more useful.
Caveats that still apply
Every caveat from the single-run report holds here. n=5 per cell gives roughly ±35pp Wilson intervals on each per-model percentage. The exact numbers are directional, not precise. The methodology page explains why. Tells-per-page is a proxy for slop fingerprint, not a verdict on design. The canonical model identifiers live in the run manifest alongside the raw samples. The regression signal gets stronger with cross-provider reproduction; absolute percentages tighten only with n≥30, which is a budget decision.
Canonical report on disk:
docs/evals/2026-04-21-swiss-cross.md.
First run against the same token:
21 April, five-model single-provider.
How to read the numbers: methodology.