AHD · Eval report · 24 April 2026 · cross-provider n=30

Same brief, different token. Eleven models, thirty samples each.

Name: AHD Eval · post-digital-green cross-provider · 24 April 2026 · n=30
Creator: Ad Astra Computing Inc
Published: 2026-04-24
License: https://opensource.org/license/mit

The different-token-same-brief triangulation the 22 April report called out in its caveats. Same briefs/landing.yml, same n=30 per cell, same two-condition pairing. One variable changed: the style token. Where the 22 April run used swiss-editorial (paper-and-ink Helvetica, default editorial conventions), post-digital-green is the opposite: single-monospace terminal aesthetic, OKLCH green palette, char-grid layout, rectangles only, conservative weight palette, no display type. External-validity testing against a token that rejects the default design conventions every lint rule was written around. Eight of eleven cells regress under the compiled prompt. The mechanism is not a compiler defect.

Addendum · 25 April 2026 · token-aware re-lint

Token-aware linting has shipped. Six cells now positive.

The Next engineering step section below queued token-aware linting as the response to this run's headline finding. That work has landed: tokens carry an explicit lint-overrides block, the linter consumes it, and rules a token has opted out of are silenced when scoring output produced under that token. Re-linting the same 660 samples under the new ruleset produces a different verdict. The original pre-fix numbers below are preserved.

Pre-fix versus post-fix, by cell

Model	Pre-fix Δ	Post-fix Δ	Direction
`@cf/google/gemma-4-26b-a4b-it`	−10.5%	+50.0%	flipped to positive
`@cf/openai/gpt-oss-120b`	+19.8%	+47.6%	doubled
`@cf/moonshotai/kimi-k2.6`	−14.4%	+30.5%	flipped to positive
`gemini-3.1-pro-preview`	−9.9%	+26.3%	flipped to positive
`@cf/mistralai/mistral-small-3.1-24b-instruct`	+34.7%	+21.7%	still positive, less margin
`@cf/meta/llama-4-scout-17b-16e-instruct`	+4.8%	+3.2%	flat both readings
`gpt-5.5`	−135.5%	−9.1%	regression nearly closed
`gpt-5.4`	−78.6%	−36.4%	still regressed, half as bad
`claude-opus-4-7`	−172.9%	−67.6%	still regressed, 60% closed
`@cf/qwen/qwen3-30b-a3b-fp8`	−17.6%	−87.5%	worse, see note
`@cf/meta/llama-3.3-70b-instruct-fp8-fast`	−100.0%	−200.0%	worse, see note

What flipped. Eight of eleven cells regressed pre-fix. Six are positive post-fix, including all three cells that flipped sign (gemma, kimi, gemini). The two previously-strongest reductions (mistral, gpt-oss) remain positive; gpt-oss now leads at 47.6 percent.

Five cells still regress. Llama 3.3 lands at the lowest absolute baseline of the run (0.10 raw, 0.30 compiled), so a 0.20 increase reads as a −200 percent change without saying much about model behaviour. The frontier cells (claude, gpt-5.4, gpt-5.5) still regress at narrower margins because rules outside the suppression list are still firing on their compiled output. Token-aware lint is not the universal fix.

What this is and is not. This is the framework correctly recognising that post-digital-green opts out of three editorial defaults the linter would otherwise fire on. It is not evidence that AHD compilation universally improves output. A different token with a different opt-out list will produce a different verdict on different cells, and tokens that opt out of nothing reproduce none of the shift seen here. External validity still wants a different-brief-same-token pass.

Re-linting ran against the same sample bytes on disk. Reproduce via ahd eval post-digital-green --samples evals --report <file> (token-aware lint is the default; pass --raw-rules to recover the pre-fix numbers).

Per-model result (pre-fix, original measurement)

Model	Provider · path	Raw → scored	Compiled → scored	Raw mean	Compiled mean	Reduction
`@cf/mistralai/mistral-small-3.1-24b-instruct`	Cloudflare Workers AI	30 → 30	30 → 30	3.37	2.20	34.7%
`@cf/openai/gpt-oss-120b`	Cloudflare Workers AI	30 → 30	30 → 30	3.03	2.43	19.8%
`@cf/meta/llama-4-scout-17b-16e-instruct`	Cloudflare Workers AI	30 → 30	30 → 30	2.10	2.00	4.8%
`gemini-3.1-pro-preview`	Google · Gemini CLI	30 → 30	30 → 26	2.80	3.08	−9.9%
`@cf/google/gemma-4-26b-a4b-it`	Cloudflare Workers AI	30 → 30	30 → 30	2.53	2.80	−10.5%
`@cf/moonshotai/kimi-k2.6`	Cloudflare Workers AI	30 → 30	30 → 30	3.00	3.43	−14.4%
`@cf/qwen/qwen3-30b-a3b-fp8`	Cloudflare Workers AI	30 → 30	30 → 30	1.70	2.00	−17.6%
`gpt-5.4`	OpenAI · Codex CLI	30 → 30	30 → 30	1.40	2.50	−78.6%
`@cf/meta/llama-3.3-70b-instruct-fp8-fast`	Cloudflare Workers AI	30 → 30	30 → 30	0.30	0.60	−100.0%
`gpt-5.5`	OpenAI · Codex CLI	30 → 30	30 → 30	1.03	2.43	−135.5%
`claude-opus-4-7`	Anthropic · Claude Code CLI	30 → 30	30 → 30	1.60	4.37	−172.9%

What the triangulation surfaced

The headline is not a number. It is a property of the framework itself: AHD's compiler and its linter are not coordinated on the token. The compiler reads post-digital-green and transmits its constraints faithfully (single monospace face, zero radius, conservative weight palette, minimal line-height variety). Models receive those constraints and follow them. The linter then scores the output against rules that embed editorial-default assumptions: require-type-pairing expects two faces, weight-variety expects more than two weights, radius-hierarchy expects sharp-vs-soft contrast. The rules fire on output that was correct for the token it was built against.

This finding the 22 April single-token run could not expose. swiss-editorial and the default lint rules happen to share conventions; no mismatch triggered. Post-digital-green, built deliberately against those conventions, makes the mismatch load-bearing. Triangulation did what triangulation is supposed to do: surfaced the real limit of the current ruleset.

A second, weaker pattern shows up in the table: the cleaner the model's raw baseline, the harder it regresses under compiled. Claude (raw 1.60), gpt-5.5 (raw 1.03), and gpt-5.4 (raw 1.40) sit at the bottom of the reduction column. Llama 3.3, with the lowest raw baseline at 0.30, regresses too but from a base where the absolute increase is small. The frontier cells' tighter raw output gives the editorially-opinionated rules more relative ground to lose when the compiler pushes them toward AHD's house style.

gpt-5.5

OpenAI released gpt-5.5 during the run window, so the model is included alongside the other frontier cells. Two observations stand out. gpt-5.5 produces the cleanest raw baseline of any frontier model in this run (1.03 tells per page). Its HTML averages roughly 8 KB per sample under both conditions, where gpt-5.4 averages 11 KB raw and 8.5 KB compiled. The new model produces tighter output by default and does not condense further under the compiled prompt. The same brief produces about a quarter less HTML on the new model.

What the compiler does correctly

Open any compiled post-digital-green sample from this run. The CSS declares the OKLCH green palette, the 80-column char-grid layout, Berkeley Mono with the standard fallback chain, border-radius: 0 everywhere, the token-specified body line-height, and the conservative weight palette. The compiler transmitted the token to the model. The model followed. The rendered page is token-faithful.

The mechanical rules confirm this from the lint side. tracking-per-size drops to 3% or below across the board under compiled (was as high as 53% raw on Qwen). line-height-per-size drops dramatically on the cells that previously failed it. require-named-grid drops on Mistral, Scout, Gemma, gpt-oss. The rules that measure typographic hygiene independent of editorial style improve under compilation. The rules that bake in editorial defaults are the ones that fire.

Which rules fire on the token-correct output

Two rules dominate the regression. require-type-pairing fires at 92% to 100% on compiled across nine of eleven cells (the exceptions are llama-3.3 at 30% and llama-4-scout at 100% raw and compiled). weight-variety fires at 70% to 100% on compiled across eight cells. Both rules are editorially opinionated against the post-digital-green specification, which explicitly requires a single monospace face and a conservative weight palette. Compiled output that obeys the token by design necessarily fails both rules.

A third pattern, less expected: respect-reduced-motion fires at 27% to 97% on compiled across six cells (versus 0% to 10% on raw). The compiled prompt appears to introduce motion declarations that do not check prefers-reduced-motion. This is a legitimate finding, not a token mismatch: it points at a gap in the compiled prompt itself rather than at the lint layer.

What this does not invalidate

The 22 April swiss-editorial result. Rules and token share conventions there. Nine of ten positive reductions stand.
The taxonomy. The thirty-nine tells are observations about AI-generated design failures, independent of which token tells the model to avoid them.
The compiler. It is doing its job. The finding is at the lint layer, not at the compiler.

Next engineering step (at original publication)

Token-aware linting. Each shipped token already declares what it requires (grid, palette, type). It needs a second field that names the rules it explicitly overrides. The linter reads the active token from the compiled output's header comment or from a <meta name="ahd-token"> anchor, and silences or downgrades rules the token has opted out of. Rules a token has not opted out of continue to fire normally. The reduced-motion gap in the compiled prompt is queued behind it.

Both items have shipped since publication. The token-aware re-lint addendum at the top of this page documents the verdict shift on the same samples; the reduced-motion guidance now lives in every compiled prompt.

Caveats

One brief, one token, one surface. The triangulation axis this run tests is the token axis. A different-brief-same-token run is still queued as a separate validity test.
Source-linter only. Thirty-eight source rules scored here. The fourteen vision rules on rendered pixels are not evaluated in this pipeline; they live in the critic and run on screenshots.
Tells-per-page is a proxy. A token constrained to rectangles and monospace has less surface for variety-oriented rules to reward; interpret magnitudes against the per-tell table in the canonical report, not in isolation.
Gemini compiled scored 26 of 30. Four samples returned non-scorable output (three near-empty HTML stubs, one runtime error). The same intermittent behaviour observed on the 22 April run.
Model versions change. Canonical model IDs and serving paths are in the per-sample envelopes and the run manifest.

Canonical report on disk: docs/evals/2026-04-24-post-digital-green-n30.md. Run manifest: evals/post-digital-green/manifest.json. Per-sample viewer (linter annotations on every sample): /evals/2026-04-24/samples/<cell>/<condition>/<id>. Sibling run on the other axis: 22 April, swiss-editorial n=30. How to read the numbers: methodology. Contribute a run: submission protocol.