AHD · Eval report · 21 April 2026 · cross-provider

Seven models. Four providers. One reproducible regression.

Name: AHD Eval · swiss-editorial cross-provider · 21 April 2026 · n=5
Creator: Ad Astra Computing Inc
Published: 2026-04-21
License: https://opensource.org/license/mit

Second run against the swiss-editorial token, across Anthropic, OpenAI, Google, Cloudflare Workers AI and the Hugging Face router. n=5 per cell, 70 calls, 62 scored, one cell inconclusive. Four models move the direction AHD promises. Two regress. One failed extraction in a revealing way. The Llama 3.3 regression reproduces an earlier measurement on a different provider.

Per-model slop reduction

Model	Provider · path	Raw → scored	Compiled → scored	Raw mean	Compiled mean	Reduction
`gpt-5-codex`	OpenAI · Codex CLI	5 → 5	5 → 5	1.40	0.40	71%
`@cf/mistralai/mistral-small-3.1-24b-instruct`	Cloudflare	5 → 5	5 → 5	3.40	1.40	59%
`deepseek-ai/DeepSeek-R1`	Hugging Face	5 → 3	5 → 3	2.00	1.00	50%
`claude-opus-4-7`	Anthropic · Claude Code CLI	5 → 5	5 → 5	1.40	0.80	43%
`gemini-2.5-pro`	Google · Gemini CLI	5 → 4	5 → 0	2.75	·	inconclusive
`Qwen/Qwen3-8B`	Hugging Face	5 → 5	5 → 5	1.80	2.60	−44%
`meta-llama/Llama-3.3-70B-Instruct`	Hugging Face	5 → 5	5 → 5	0.40	1.20	−200%

The reproducible regression

Llama 3.3 70B went from 0.40 raw tells to 1.20 compiled, a −200% change. This is not noise. The earlier 21 April run on the same model, served by Cloudflare Workers AI instead of Hugging Face, landed at −150% in the same direction. Two independent inference providers, same model, same regression. The cross-validation turns this from "a surprising number" into a published finding.

The mechanism is visible in the per-tell data. Llama 3.3's raw output on this brief is typographically thin: few fonts, no grid, minimal CSS, so the linter has almost nothing to fire on. The compiled brief instructs the model to emit an asymmetric 12-column grid, paired typography, spot-colour discipline, and inline rule-citation comments. Llama 3.3 attempts the richer page, which exposes more decision surface for the linter to catch. A more ambitious attempt that isn't quite executed well scores worse than a thin attempt that didn't try.

Practical implication for anyone running Llama 3.3 70B on an editorial-landing brief: the AHD-compiled system prompt does not help you. Use the raw brief, or route the task to a model the framework has measured a positive reduction on. This is what AHD is for: not to claim victory, but to tell you which tools to reach for.

The four that moved

gpt-5-codex via Codex CLI produced the cleanest compiled output: 1.40 mean raw tells, 0.40 compiled, a 71% reduction. All ten samples extracted cleanly. The compiled CSS carries inline rule: annotations where the model cites the token rule it's following per decision, the exact behaviour the compiled system prompt asks for.

Mistral Small 3.1 via Cloudflare Workers AI cut its tells by 59%, from 3.40 to 1.40. The compiled prompt specifically moved Mistral away from radius-hierarchy, require-named-grid and require-type-pairing fires, the three rules that hit 80–100% in raw but ≤20% in compiled. This is the shape we would want every model to show.

DeepSeek R1 via Hugging Face halved its tells, from 2.00 to 1.00. Only 3/5 raw and 3/5 compiled samples were scored. The other two per cell returned reasoning-only output before emitting an HTML block, which the runner correctly filters out rather than silently counting as clean. The honest number is computed over the scored subset.

Claude Opus 4.7 via the Claude Code CLI went from 1.40 to 0.80, a 43% reduction. This path is not bit-identical to the Claude-through-/v1/messages route. The CLI harness adds framing even with tools disabled and a custom system prompt, so what we're measuring here is Claude-through-Claude-Code, not Claude-through-API. The direction and shape match the other positive cells.

Why Gemini came back inconclusive

Gemini 2.5 Pro via the Gemini CLI produced usable HTML on 4 of 5 raw samples (mean 2.75 tells), then on the compiled prompt returned five short acknowledgement messages like "I have created the ahd-landing-page.html file as requested, following the detailed brief and style guide." The CLI interpreted the compiled brief as an agent task ("create this HTML file") rather than a text-generation request. Because the Gemini CLI was run with tools disabled (--approval-mode plan), it couldn't actually write the file, so it returned nothing useful.

This is not a framework bug. It's a real finding about how the Gemini CLI routes a richly-specified system prompt. Gemini through the REST API would likely behave differently; that's a follow-up measurement for a future run. In the table above the row shows inconclusive rather than fabricating a "100% reduction" from zero scored compiled samples.

The run, fully declared

Briefbriefs/landing.yml
Tokenswiss-editorial
Samples per cell5
Max output tokens12000
Rules run per sample31 source-level (28 HTML/CSS + 3 SVG)
Total API/CLI calls70
Scored samples62 of 70

The narrower story

Claude, GPT-5-codex, Mistral and DeepSeek R1 all move in the direction AHD promises on this token. The two Hugging Face open weights that regressed regressed in different ways: Qwen3 is an 8B model that the compiled brief pushes to attempt more complexity than it can deliver well; Llama 3.3 is a 70B model that produces genuinely ambitious compiled output and trips more rules in the process. Neither regression invalidates the framework; both are what the framework exists to surface. A run that claimed the compiled brief helps every model would be selling. The honest picture is more useful.

Caveats that still apply

Every caveat from the single-run report holds here. n=5 per cell gives roughly ±35pp Wilson intervals on each per-model percentage. The exact numbers are directional, not precise. The methodology page explains why. Tells-per-page is a proxy for slop fingerprint, not a verdict on design. The canonical model identifiers live in the run manifest alongside the raw samples. The regression signal gets stronger with cross-provider reproduction; absolute percentages tighten only with n≥30, which is a budget decision.

Canonical report on disk: docs/evals/2026-04-21-swiss-cross.md. First run against the same token: 21 April, five-model narrow-roster. How to read the numbers: methodology.