AHD · Eval report · 24 April 2026 · cross-provider n=30
Same brief, different token. Eleven models, thirty samples each.
The different-token-same-brief triangulation the
22 April report
called out in its caveats. Same briefs/landing.yml,
same n=30 per cell, same two-condition pairing. One variable
changed: the style token. Where the 22 April run used
swiss-editorial (paper-and-ink Helvetica, default
editorial conventions), post-digital-green is the
opposite: single-monospace terminal aesthetic, OKLCH green
palette, char-grid layout, rectangles only, conservative weight
palette, no display type. External-validity testing against a
token that rejects the default design conventions every lint
rule was written around. Eight of eleven cells regress
under the compiled prompt. The mechanism is not a
compiler defect.
Per-model result (pre-fix, original measurement)
| Model | Provider · path | Raw → scored | Compiled → scored | Raw mean | Compiled mean | Reduction |
|---|---|---|---|---|---|---|
@cf/mistralai/mistral-small-3.1-24b-instruct | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 3.37 | 2.20 | 34.7% |
@cf/openai/gpt-oss-120b | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 3.03 | 2.43 | 19.8% |
@cf/meta/llama-4-scout-17b-16e-instruct | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 2.10 | 2.00 | 4.8% |
gemini-3.1-pro-preview | Google · Gemini CLI | 30 → 30 | 30 → 26 | 2.80 | 3.08 | −9.9% |
@cf/google/gemma-4-26b-a4b-it | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 2.53 | 2.80 | −10.5% |
@cf/moonshotai/kimi-k2.6 | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 3.00 | 3.43 | −14.4% |
@cf/qwen/qwen3-30b-a3b-fp8 | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 1.70 | 2.00 | −17.6% |
gpt-5.4 | OpenAI · Codex CLI | 30 → 30 | 30 → 30 | 1.40 | 2.50 | −78.6% |
@cf/meta/llama-3.3-70b-instruct-fp8-fast | Cloudflare Workers AI | 30 → 30 | 30 → 30 | 0.30 | 0.60 | −100.0% |
gpt-5.5 | OpenAI · Codex CLI | 30 → 30 | 30 → 30 | 1.03 | 2.43 | −135.5% |
claude-opus-4-7 | Anthropic · Claude Code CLI | 30 → 30 | 30 → 30 | 1.60 | 4.37 | −172.9% |
What the triangulation surfaced
The headline is not a number. It is a property of the framework
itself: AHD's compiler and its linter are not coordinated on the
token. The compiler reads post-digital-green and
transmits its constraints faithfully (single monospace face, zero
radius, conservative weight palette, minimal line-height
variety). Models receive those constraints and follow them. The
linter then scores the output against rules that embed
editorial-default assumptions:
require-type-pairing expects two faces,
weight-variety expects more than two weights,
radius-hierarchy expects sharp-vs-soft contrast.
The rules fire on output that was correct for the token it was
built against.
This finding the 22 April single-token run could not expose.
swiss-editorial and the default lint rules happen
to share conventions; no mismatch triggered. Post-digital-green,
built deliberately against those conventions, makes the mismatch
load-bearing. Triangulation did what triangulation is supposed
to do: surfaced the real limit of the current ruleset.
A second, weaker pattern shows up in the table: the cleaner the model's raw baseline, the harder it regresses under compiled. Claude (raw 1.60), gpt-5.5 (raw 1.03), and gpt-5.4 (raw 1.40) sit at the bottom of the reduction column. Llama 3.3, with the lowest raw baseline at 0.30, regresses too but from a base where the absolute increase is small. The frontier cells' tighter raw output gives the editorially-opinionated rules more relative ground to lose when the compiler pushes them toward AHD's house style.
gpt-5.5
OpenAI released gpt-5.5 during the run window, so the model is included alongside the other frontier cells. Two observations stand out. gpt-5.5 produces the cleanest raw baseline of any frontier model in this run (1.03 tells per page). Its HTML averages roughly 8 KB per sample under both conditions, where gpt-5.4 averages 11 KB raw and 8.5 KB compiled. The new model produces tighter output by default and does not condense further under the compiled prompt. The same brief produces about a quarter less HTML on the new model.
What the compiler does correctly
Open any compiled post-digital-green sample from this run. The
CSS declares the OKLCH green palette, the 80-column char-grid
layout, Berkeley Mono with the standard fallback chain,
border-radius: 0 everywhere, the token-specified
body line-height, and the conservative weight palette. The
compiler transmitted the token to the model. The model
followed. The rendered page is token-faithful.
The mechanical rules confirm this from the lint side.
tracking-per-size drops to 3% or below across the
board under compiled (was as high as 53% raw on Qwen).
line-height-per-size drops dramatically on the
cells that previously failed it. require-named-grid
drops on Mistral, Scout, Gemma, gpt-oss. The rules that measure
typographic hygiene independent of editorial style improve
under compilation. The rules that bake in editorial defaults are
the ones that fire.
Which rules fire on the token-correct output
Two rules dominate the regression. require-type-pairing
fires at 92% to 100% on compiled across nine of eleven cells (the
exceptions are llama-3.3 at 30% and llama-4-scout at 100% raw and
compiled). weight-variety fires at 70% to 100% on
compiled across eight cells. Both rules are editorially
opinionated against the post-digital-green specification, which
explicitly requires a single monospace face and a conservative
weight palette. Compiled output that obeys the token by design
necessarily fails both rules.
A third pattern, less expected:
respect-reduced-motion fires at 27% to 97% on
compiled across six cells (versus 0% to 10% on raw). The
compiled prompt appears to introduce motion declarations that do
not check prefers-reduced-motion. This is a
legitimate finding, not a token mismatch: it points at a gap in
the compiled prompt itself rather than at the lint layer.
What this does not invalidate
- The 22 April swiss-editorial result. Rules and token share conventions there. Nine of ten positive reductions stand.
- The taxonomy. The thirty-nine tells are observations about AI-generated design failures, independent of which token tells the model to avoid them.
- The compiler. It is doing its job. The finding is at the lint layer, not at the compiler.
Next engineering step (at original publication)
Token-aware linting. Each shipped token already declares what it
requires (grid, palette, type). It needs a second field that
names the rules it explicitly overrides. The linter reads the
active token from the compiled output's header comment or from a
<meta name="ahd-token"> anchor, and silences
or downgrades rules the token has opted out of. Rules a token
has not opted out of continue to fire normally. The
reduced-motion gap in the compiled prompt is queued behind it.
Both items have shipped since publication. The token-aware re-lint addendum at the top of this page documents the verdict shift on the same samples; the reduced-motion guidance now lives in every compiled prompt.
Caveats
- One brief, one token, one surface. The triangulation axis this run tests is the token axis. A different-brief-same-token run is still queued as a separate validity test.
- Source-linter only. Thirty-eight source rules scored here. The fourteen vision rules on rendered pixels are not evaluated in this pipeline; they live in the critic and run on screenshots.
- Tells-per-page is a proxy. A token constrained to rectangles and monospace has less surface for variety-oriented rules to reward; interpret magnitudes against the per-tell table in the canonical report, not in isolation.
- Gemini compiled scored 26 of 30. Four samples returned non-scorable output (three near-empty HTML stubs, one runtime error). The same intermittent behaviour observed on the 22 April run.
- Model versions change. Canonical model IDs and serving paths are in the per-sample envelopes and the run manifest.
Canonical report on disk:
docs/evals/2026-04-24-post-digital-green-n30.md.
Run manifest: evals/post-digital-green/manifest.json.
Per-sample viewer (linter annotations on every sample):
/evals/2026-04-24/samples/<cell>/<condition>/<id>.
Sibling run on the other axis:
22 April, swiss-editorial n=30.
How to read the numbers: methodology.
Contribute a run: submission protocol.