AHD · Eval report · 22 April 2026 · cross-provider n=30
Ten models. Thirty samples each. Six hundred measured.
The first run against swiss-editorial at n=30 per
cell. Ten models, two conditions, six hundred samples. Eight
cells show positive reductions under the compiled prompt, one
is effectively flat, one regresses. Frontier cells served via
subscription CLIs (Claude Code, Codex, Gemini). Seven OSS cells
served via Cloudflare Workers AI. The Wilson interval on each
per-cell percentage tightens from roughly ±35 points at n=5 to
roughly ±18 points at n=30: the shape of each result is now
trustworthy to a resolution the n=5 run could not support.
Per-model slop reduction
| Model | Provider · path | Raw → scored | Compiled → scored | Raw mean | Compiled mean | Reduction |
|---|---|---|---|---|---|---|
@cf/openai/gpt-oss-120b | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 3.50 | 0.77 | 78.1% |
@cf/mistralai/mistral-small-3.1-24b-instruct | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 3.47 | 1.30 | 62.5% |
@cf/moonshotai/kimi-k2.6 | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 2.67 | 1.00 | 62.5% |
gemini-3.1-pro-preview | Google · Gemini CLI | 30 → 30 | 30 → 30 | 2.97 | 1.13 | 61.8% |
claude-opus-4-7 | Anthropic · Claude Code CLI | 30 → 30 | 30 → 30 | 1.80 | 0.73 | 59.3% |
@cf/google/gemma-4-26b-a4b-it | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 2.67 | 1.37 | 48.7% |
@cf/meta/llama-4-scout-17b-16e-instruct | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 2.17 | 1.40 | 35.4% |
gpt-5.4 | OpenAI · Codex CLI | 30 → 30 | 30 → 30 | 1.23 | 1.00 | 18.9% |
@cf/qwen/qwen3-30b-a3b-fp8 | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 1.90 | 1.73 | 8.8% |
@cf/meta/llama-3.3-70b-instruct-fp8-fast | Cloudflare Workers AI | 30 → 29 | 30 → 30 | 0.28 | 0.60 | −117.5% |
What the numbers say
Eight of ten cells reduce tells under the compiled prompt. The median reduction across the nine non-regressing cells sits around 59 percent. The top four cells land inside a tight band between 59 and 78 percent, even though those four span four different providers (Cloudflare, Anthropic via Claude Code, Google via Gemini CLI, Cloudflare again) and four different model families (gpt-oss, Mistral, Kimi, Gemini 3). The result is not a single-model or single-provider artefact.
Two cells move weakly. gpt-5.4 via Codex CLI drops
from 1.23 raw tells to 1.00 compiled, an 18.9 percent reduction.
The raw baseline is already low: Codex-served gpt-5.4 produces
a fairly restrained page by default, so there is less surface
for the compiled prompt to correct. @cf/qwen/qwen3-30b-a3b-fp8
moves from 1.90 to 1.73, an 8.8 percent reduction. This is
inside the Wilson interval and should be read as flat, not as
a small win.
The single regression
@cf/meta/llama-3.3-70b-instruct-fp8-fast
regressed from 0.28 raw tells to 0.60 compiled, a −117.5 percent
change. This reproduces the same-direction regression
seen at n=5 on both Cloudflare and Hugging Face serving paths
in the 21 April
cross-provider run. The mechanism is the same: Llama 3.3's
raw output on this brief is typographically thin, few fonts,
minimal CSS, no grid, so the linter has little to fire on. The
compiled brief instructs the model to emit an asymmetric grid,
paired typography, spot-colour discipline, and inline
rule: annotations. Llama 3.3 attempts the richer
page and exposes more decision surface, which the linter catches.
A more ambitious attempt that isn't quite executed scores worse
than a thin attempt that never tried.
Practical implication: running Llama 3.3 70B on an editorial-landing brief, the AHD-compiled system prompt does not help you. Use the raw brief, or route the task to one of the eight cells above that measures positive. Framework purpose: tell you which tools to reach for, not claim universal victory.
Two notes on serving
The Kimi K2.6 cell required a chat-template fix.
Kimi K2.6 on Cloudflare Workers AI defaults thinking-mode on;
the compiled system prompt (roughly 9 KB) exhausted the token
budget before any visible content emitted. The taxonomy could
not score zero-byte pages. We patched the runner to pass
chat_template_kwargs: { thinking: false }
at request time, which turns off Kimi's thinking trace before
inference. After the patch the cell ran clean at 30/30. This
is a serving-layer defect, not a design-slop tell, and lives
in the serving-tells
catalog rather than the main taxonomy.
Three frontier cells used their provider's CLI, not the raw HTTP API. Claude Opus via Claude Code CLI, gpt-5.4 via Codex CLI, Gemini 3.1 Pro via Gemini CLI. This is the path most humans actually use for these models today, which makes the CLI path more ecologically valid than the API path for frontier cells. The OSS cells use the Cloudflare Workers AI OpenAI-compatible endpoint because that is the one free-tier path the community can reproduce without paying. Both serving paths are documented in the run manifest.
How to read these numbers
A per-cell reduction percentage has a confidence interval, and at n=30 the Wilson interval on each cell's raw-versus-compiled proportion is roughly ±18 points, down from the roughly ±35 points that n=5 delivers. The top four cells all sit outside the interval band (59 to 78 percent) and each cell's signal survives the interval independently of the others. The weak cells (gpt-5.4 at 19 percent, Qwen 3 at 9 percent) sit inside the band and should be read as uncertain-direction. The Llama 3.3 regression at −117 percent is well outside the band and, critically, reproduces the direction of the 21 April n=5 result, so the cross-time stability of the regression adds independent weight.
None of this establishes a leaderboard. A different brief and a different token will re-order the table. The methodology page explains what this run measures and what it deliberately does not.
Scope
This run measures one brief against one token against one
surface: briefs/landing.yml compiled under
swiss-editorial, rendered as editorial-landing
HTML. The reductions below should not be read as broad
taxonomy performance: a different token (say,
post-digital-green) or a different brief (a
docs-landing, a dashboard) will produce a different ordering.
External validity needs independent runs along the
different-token-same-brief and different-brief-same-token axes;
both are queued as follow-ups. The 38 source-level rules scored
here cover roughly three quarters of the taxonomy; the fourteen
vision rules on rendered pixels are not scored in this pipeline
and are queued as a separate pass.
Caveats that still apply
Every caveat from earlier runs holds here. Tells-per-page is a proxy for slop fingerprint, not a verdict on design: a page can pass every rule and still be bad. The vision-only rules (nine at time of writing) are not scored in the source-lint pipeline. Canonical model identifiers and serving paths live in the run manifest alongside the raw samples. The compiled prompt is deterministic from the brief and token; both are versioned and reproducible. Re-running the same cells tomorrow will not produce identical HTML, because sampling is stochastic, but it should produce distributions that land within the Wilson interval around the numbers above.
Canonical report on disk:
docs/evals/2026-04-22-swiss-n30.md.
Prior runs against the same token:
21 April, seven-model cross-provider,
21 April, five-model narrow-roster.
How to read the numbers: methodology.
Contribute a run: submission protocol.