AHD Artificial Human Design

AHD · Eval report · 24 April 2026 · cross-provider n=30

Same brief, different token. Eleven models, thirty samples each.

The different-token-same-brief triangulation the 22 April report called out in its caveats. Same briefs/landing.yml, same n=30 per cell, same two-condition pairing. One variable changed: the style token. Where the 22 April run used swiss-editorial (paper-and-ink Helvetica, default editorial conventions), post-digital-green is the opposite: single-monospace terminal aesthetic, OKLCH green palette, char-grid layout, rectangles only, conservative weight palette, no display type. External-validity testing against a token that rejects the default design conventions every lint rule was written around. Eight of eleven cells regress under the compiled prompt. The mechanism is not a compiler defect.

Per-model result (pre-fix, original measurement)

Model Provider · path Raw → scored Compiled → scored Raw mean Compiled mean Reduction
@cf/mistralai/mistral-small-3.1-24b-instruct Cloudflare Workers AI 30 → 3030 → 30 3.372.20 34.7%
@cf/openai/gpt-oss-120b Cloudflare Workers AI 30 → 3030 → 30 3.032.43 19.8%
@cf/meta/llama-4-scout-17b-16e-instruct Cloudflare Workers AI 30 → 3030 → 30 2.102.00 4.8%
gemini-3.1-pro-preview Google · Gemini CLI 30 → 3030 → 26 2.803.08 −9.9%
@cf/google/gemma-4-26b-a4b-it Cloudflare Workers AI 30 → 3030 → 30 2.532.80 −10.5%
@cf/moonshotai/kimi-k2.6 Cloudflare Workers AI 30 → 3030 → 30 3.003.43 −14.4%
@cf/qwen/qwen3-30b-a3b-fp8 Cloudflare Workers AI 30 → 3030 → 30 1.702.00 −17.6%
gpt-5.4 OpenAI · Codex CLI 30 → 3030 → 30 1.402.50 −78.6%
@cf/meta/llama-3.3-70b-instruct-fp8-fast Cloudflare Workers AI 30 → 3030 → 30 0.300.60 −100.0%
gpt-5.5 OpenAI · Codex CLI 30 → 3030 → 30 1.032.43 −135.5%
claude-opus-4-7 Anthropic · Claude Code CLI 30 → 3030 → 30 1.604.37 −172.9%

What the triangulation surfaced

The headline is not a number. It is a property of the framework itself: AHD's compiler and its linter are not coordinated on the token. The compiler reads post-digital-green and transmits its constraints faithfully (single monospace face, zero radius, conservative weight palette, minimal line-height variety). Models receive those constraints and follow them. The linter then scores the output against rules that embed editorial-default assumptions: require-type-pairing expects two faces, weight-variety expects more than two weights, radius-hierarchy expects sharp-vs-soft contrast. The rules fire on output that was correct for the token it was built against.

This finding the 22 April single-token run could not expose. swiss-editorial and the default lint rules happen to share conventions; no mismatch triggered. Post-digital-green, built deliberately against those conventions, makes the mismatch load-bearing. Triangulation did what triangulation is supposed to do: surfaced the real limit of the current ruleset.

A second, weaker pattern shows up in the table: the cleaner the model's raw baseline, the harder it regresses under compiled. Claude (raw 1.60), gpt-5.5 (raw 1.03), and gpt-5.4 (raw 1.40) sit at the bottom of the reduction column. Llama 3.3, with the lowest raw baseline at 0.30, regresses too but from a base where the absolute increase is small. The frontier cells' tighter raw output gives the editorially-opinionated rules more relative ground to lose when the compiler pushes them toward AHD's house style.

gpt-5.5

OpenAI released gpt-5.5 during the run window, so the model is included alongside the other frontier cells. Two observations stand out. gpt-5.5 produces the cleanest raw baseline of any frontier model in this run (1.03 tells per page). Its HTML averages roughly 8 KB per sample under both conditions, where gpt-5.4 averages 11 KB raw and 8.5 KB compiled. The new model produces tighter output by default and does not condense further under the compiled prompt. The same brief produces about a quarter less HTML on the new model.

What the compiler does correctly

Open any compiled post-digital-green sample from this run. The CSS declares the OKLCH green palette, the 80-column char-grid layout, Berkeley Mono with the standard fallback chain, border-radius: 0 everywhere, the token-specified body line-height, and the conservative weight palette. The compiler transmitted the token to the model. The model followed. The rendered page is token-faithful.

The mechanical rules confirm this from the lint side. tracking-per-size drops to 3% or below across the board under compiled (was as high as 53% raw on Qwen). line-height-per-size drops dramatically on the cells that previously failed it. require-named-grid drops on Mistral, Scout, Gemma, gpt-oss. The rules that measure typographic hygiene independent of editorial style improve under compilation. The rules that bake in editorial defaults are the ones that fire.

Which rules fire on the token-correct output

Two rules dominate the regression. require-type-pairing fires at 92% to 100% on compiled across nine of eleven cells (the exceptions are llama-3.3 at 30% and llama-4-scout at 100% raw and compiled). weight-variety fires at 70% to 100% on compiled across eight cells. Both rules are editorially opinionated against the post-digital-green specification, which explicitly requires a single monospace face and a conservative weight palette. Compiled output that obeys the token by design necessarily fails both rules.

A third pattern, less expected: respect-reduced-motion fires at 27% to 97% on compiled across six cells (versus 0% to 10% on raw). The compiled prompt appears to introduce motion declarations that do not check prefers-reduced-motion. This is a legitimate finding, not a token mismatch: it points at a gap in the compiled prompt itself rather than at the lint layer.

What this does not invalidate

Next engineering step (at original publication)

Token-aware linting. Each shipped token already declares what it requires (grid, palette, type). It needs a second field that names the rules it explicitly overrides. The linter reads the active token from the compiled output's header comment or from a <meta name="ahd-token"> anchor, and silences or downgrades rules the token has opted out of. Rules a token has not opted out of continue to fire normally. The reduced-motion gap in the compiled prompt is queued behind it.

Both items have shipped since publication. The token-aware re-lint addendum at the top of this page documents the verdict shift on the same samples; the reduced-motion guidance now lives in every compiled prompt.

Caveats


Canonical report on disk: docs/evals/2026-04-24-post-digital-green-n30.md. Run manifest: evals/post-digital-green/manifest.json. Per-sample viewer (linter annotations on every sample): /evals/2026-04-24/samples/<cell>/<condition>/<id>. Sibling run on the other axis: 22 April, swiss-editorial n=30. How to read the numbers: methodology. Contribute a run: submission protocol.