AHD · Eval report · 22 April 2026 · cross-provider n=30

Ten models. Thirty samples each. Six hundred measured.

Name: AHD Eval · swiss-editorial cross-provider · 22 April 2026 · n=30
Creator: Ad Astra Computing Inc
Published: 2026-04-22
License: https://opensource.org/license/mit

The first run against swiss-editorial at n=30 per cell. Ten models, two conditions, six hundred samples. Eight cells show positive reductions under the compiled prompt, one is effectively flat, one regresses. Frontier cells served via subscription CLIs (Claude Code, Codex, Gemini). Seven OSS cells served via Cloudflare Workers AI. The Wilson interval on each per-cell percentage tightens from roughly ±35 points at n=5 to roughly ±18 points at n=30: the shape of each result is now trustworthy to a resolution the n=5 run could not support.

Errata · 24 April 2026

Three post-publication corrections

Three issues surfaced after this report was published on 22 April, and all three warrant public record rather than a silent refresh. None changes the headline claim that eight of ten cells show a positive reduction under the compiled prompt, nor the ordering of models. All three shift specific numbers and deserve to be named. The current per-tell counts on the taxonomy page reflect the post-correction readings; the tables below are preserved as originally published on 22 April.

1 · Tracking-per-size false positive on CSS custom properties

The ahd/tracking-per-size rule was not resolving var(--token-name) references when reading letter-spacing. Sites that set tracking through a :root custom property registered as "no letter-spacing" and false-positive-fired. Fix landed in AHD framework commit 6952b6e after this report shipped.

Effect on this run: three per-cell values drop under the stricter rule. The ordering of models is unchanged.
- ahd/tracking-per-size · @cf/mistralai/mistral-small-3.1-24b-instruct compiled: was 47% · now 40%
- ahd/tracking-per-size · @cf/moonshotai/kimi-k2.6 compiled: was 27% · now 23%
- ahd/tracking-per-size · @cf/qwen/qwen3-30b-a3b-fp8 compiled: was 7% · now 3%
2 · gpt-5.4 samples stored as JSON-escaped strings

The codex-cli runner was missing evt.item.text in the JSONL event parser's field chain. When the final agent_message arrived, the parser fell through to scanning raw stdout, which pulled JSON-escaped HTML out of the serialised event stream verbatim. All sixty gpt-5.4 sample files (thirty raw, thirty compiled) were written to disk with literal \n and \" sequences instead of newlines and quotes. Fix in AHD framework commit 8717ccd.

Effect on this run: zero change to per-rule percentages. Pattern-based lint rules match the same CSS property strings whether newlines are real or escaped (font-family: is still font-family:), so the 22 April lint numbers for gpt-5.4 hold after re-linting the decoded samples. The fix is a display correction: the rendered-sample viewer at /evals/2026-04-22/samples/gpt-5.4/ now shows the page the model actually produced, rather than a plaintext transcript of the JSON encoding.
3 · Em-dash rule missed dashes after nested prose elements

The ahd/no-em-dashes-in-prose rule used a regex with lazy matching and a unioned open/close tag set. When a prose element (e.g. <li>) contained a nested <strong>, <em>, or <a>, the regex terminated at the first nested closer and never scanned the remaining text. Em-dashes that followed the nested element slipped past the rule entirely. Fix migrated the rule off regex onto a parse5 AST walker that handles nesting correctly. Shipped in AHD framework v0.8.1.

Effect on this run: four per-cell values shift. The largest is on Claude Opus 4.7, whose output uses heavy nested prose (frequent <strong> and <em> inside paragraphs). The previously-reported rate understated Claude's em-dash use at both conditions. Kimi K2.6 raw also rises. The compiled prompt still reduces Claude's rate (70% → 60%), just from a higher baseline than the 22 April report reflected.
- ahd/no-em-dashes-in-prose · claude-opus-4-7 raw: was 53% · now 70%
- ahd/no-em-dashes-in-prose · claude-opus-4-7 compiled: was 27% · now 60%
- ahd/no-em-dashes-in-prose · @cf/moonshotai/kimi-k2.6 raw: was 70% · now 83%
- ahd/no-em-dashes-in-prose · @cf/meta/llama-3.3-70b-instruct-fp8-fast raw: was 3% · now 0%

All three issues surfaced within forty-eight hours of publication while building the per-sample viewer that lets readers cross-check every number. That viewer is the mechanism that found them; a framework that hid its samples could not have produced these findings. The methodology page describes this posture in full.

Per-model slop reduction

Model	Provider · path	Raw → scored	Compiled → scored	Raw mean	Compiled mean	Reduction
`@cf/openai/gpt-oss-120b`	Cloudflare Workers AI	30 → 30	30 → 30	3.50	0.77	78.1%
`@cf/mistralai/mistral-small-3.1-24b-instruct`	Cloudflare Workers AI	30 → 30	30 → 30	3.47	1.30	62.5%
`@cf/moonshotai/kimi-k2.6`	Cloudflare Workers AI	30 → 30	30 → 30	2.67	1.00	62.5%
`gemini-3.1-pro-preview`	Google · Gemini CLI	30 → 30	30 → 30	2.97	1.13	61.8%
`claude-opus-4-7`	Anthropic · Claude Code CLI	30 → 30	30 → 30	1.80	0.73	59.3%
`@cf/google/gemma-4-26b-a4b-it`	Cloudflare Workers AI	30 → 30	30 → 30	2.67	1.37	48.7%
`@cf/meta/llama-4-scout-17b-16e-instruct`	Cloudflare Workers AI	30 → 30	30 → 30	2.17	1.40	35.4%
`gpt-5.4`	OpenAI · Codex CLI	30 → 30	30 → 30	1.23	1.00	18.9%
`@cf/qwen/qwen3-30b-a3b-fp8`	Cloudflare Workers AI	30 → 30	30 → 30	1.90	1.73	8.8%
`@cf/meta/llama-3.3-70b-instruct-fp8-fast`	Cloudflare Workers AI	30 → 29	30 → 30	0.28	0.60	−117.5%

What the numbers say

Eight of ten cells reduce tells under the compiled prompt. The median reduction across the nine non-regressing cells sits around 59 percent. The top four cells land inside a tight band between 59 and 78 percent, even though those four span four different providers (Cloudflare, Anthropic via Claude Code, Google via Gemini CLI, Cloudflare again) and four different model families (gpt-oss, Mistral, Kimi, Gemini 3). The result is not a single-model or single-provider artefact.

Two cells move weakly. gpt-5.4 via Codex CLI drops from 1.23 raw tells to 1.00 compiled, an 18.9 percent reduction. The raw baseline is already low: Codex-served gpt-5.4 produces a fairly restrained page by default, so there is less surface for the compiled prompt to correct. @cf/qwen/qwen3-30b-a3b-fp8 moves from 1.90 to 1.73, an 8.8 percent reduction. This is inside the Wilson interval and should be read as flat, not as a small win.

The single regression

@cf/meta/llama-3.3-70b-instruct-fp8-fast regressed from 0.28 raw tells to 0.60 compiled, a −117.5 percent change. This reproduces the same-direction regression seen at n=5 on both Cloudflare and Hugging Face serving paths in the 21 April cross-provider run. The mechanism is the same: Llama 3.3's raw output on this brief is typographically thin, few fonts, minimal CSS, no grid, so the linter has little to fire on. The compiled brief instructs the model to emit an asymmetric grid, paired typography, spot-colour discipline, and inline rule: annotations. Llama 3.3 attempts the richer page and exposes more decision surface, which the linter catches. A more ambitious attempt that isn't quite executed scores worse than a thin attempt that never tried.

Practical implication: running Llama 3.3 70B on an editorial-landing brief, the AHD-compiled system prompt does not help you. Use the raw brief, or route the task to one of the eight cells above that measures positive. Framework purpose: tell you which tools to reach for, not claim universal victory.

Two notes on serving

The Kimi K2.6 cell required a chat-template fix. Kimi K2.6 on Cloudflare Workers AI defaults thinking-mode on; the compiled system prompt (roughly 9 KB) exhausted the token budget before any visible content emitted. The taxonomy could not score zero-byte pages. We patched the runner to pass chat_template_kwargs: { thinking: false } at request time, which turns off Kimi's thinking trace before inference. After the patch the cell ran clean at 30/30. This is a serving-layer defect, not a design-slop tell, and lives in the serving-tells catalog rather than the main taxonomy.

Three frontier cells used their provider's CLI, not the raw HTTP API. Claude Opus via Claude Code CLI, gpt-5.4 via Codex CLI, Gemini 3.1 Pro via Gemini CLI. This is the path most humans actually use for these models today, which makes the CLI path more ecologically valid than the API path for frontier cells. The OSS cells use the Cloudflare Workers AI OpenAI-compatible endpoint because that is the one free-tier path the community can reproduce without paying. Both serving paths are documented in the run manifest.

How to read these numbers

A per-cell reduction percentage has a confidence interval, and at n=30 the Wilson interval on each cell's raw-versus-compiled proportion is roughly ±18 points, down from the roughly ±35 points that n=5 delivers. The top four cells all sit outside the interval band (59 to 78 percent) and each cell's signal survives the interval independently of the others. The weak cells (gpt-5.4 at 19 percent, Qwen 3 at 9 percent) sit inside the band and should be read as uncertain-direction. The Llama 3.3 regression at −117 percent is well outside the band and, critically, reproduces the direction of the 21 April n=5 result, so the cross-time stability of the regression adds independent weight.

None of this establishes a leaderboard. A different brief and a different token will re-order the table. The methodology page explains what this run measures and what it deliberately does not.

Scope

This run measures one brief against one token against one surface: briefs/landing.yml compiled under swiss-editorial, rendered as editorial-landing HTML. The reductions below should not be read as broad taxonomy performance: a different token (say, post-digital-green) or a different brief (a docs-landing, a dashboard) will produce a different ordering. External validity needs independent runs along the different-token-same-brief and different-brief-same-token axes. The 24 April post-digital-green run is the different-token-same-brief pass. The different-brief-same-token run is still queued. The 38 source-level rules scored here cover roughly three quarters of the taxonomy; the fourteen vision rules on rendered pixels are not scored in this pipeline and are queued as a separate pass.

Caveats that still apply

Every caveat from earlier runs holds here. Tells-per-page is a proxy for slop fingerprint, not a verdict on design: a page can pass every rule and still be bad. The vision-only rules (nine at time of writing) are not scored in the source-lint pipeline. Canonical model identifiers and serving paths live in the run manifest alongside the raw samples. The compiled prompt is deterministic from the brief and token; both are versioned and reproducible. Re-running the same cells tomorrow will not produce identical HTML, because sampling is stochastic, but it should produce distributions that land within the Wilson interval around the numbers above.

Canonical report on disk: docs/evals/2026-04-22-swiss-n30.md. Prior runs against the same token: 21 April, seven-model cross-provider, 21 April, five-model narrow-roster. How to read the numbers: methodology. Contribute a run: submission protocol.