AHD · Evals

Every run.

One row per AHD eval. Each row links to its own dated report with the full manifest, per-cell attempted/extracted/scored counts, and per-tell frequency tables. Every claim on this site traces to one of these reports.

Ten models · One brief · Thirty samples each

2026-04-22 Cross-provider n=30 Token swiss-editorial n=30 600 samples

Ten models, n=30 per cell, 600 samples. Eight of ten cells showed positive reductions under the compiled prompt, one flat, one regression. Best: gpt-oss-120b at 78.1% fewer tells. Three frontier cells via subscription CLIs (Claude Code, Codex, Gemini CLI), seven OSS via Cloudflare Workers AI. Wilson interval tightens from roughly ±35% at n=5 to roughly ±18% at n=30.
Seven models across four providers

2026-04-21 Cross-provider n=5 Token swiss-editorial n=5 70 samples

Four positive reductions, one inconclusive, two regressions. Llama 3.3's regression reproduces across Cloudflare and Hugging Face, turning a single-cell finding into a cross-provider result.
Five models · n=5 · Zero errors

2026-04-21 Five-model n=5 Token swiss-editorial n=5 50 samples

Claude Opus (Anthropic API) plus four OSS models on Cloudflare Workers AI. Claude dropped to zero tells compiled; Llama 3.3 70B regressed. Full per-model and per-tell breakdown with every attempted-vs-scored count published.

What's in each report

Every per-run page carries the same surface: a summary table of raw versus compiled tells per cell, the attempted / extracted / scored column triple for every cell, a per-tell frequency table showing which rules fired in which conditions, the exact brief and compiled-prompt bytes used, and the run manifest with canonical model identifiers and serving paths. Reading a row on this index gives you the headline; clicking into the report gives you the receipts.

What a run does not carry

A run is not a leaderboard. A run is evidence for a specific brief under a specific token against a specific set of models. Different briefs and different tokens produce different rankings. If you want to generalise from any one row to "model X is better than model Y at design," that generalisation is on you, not on us. The frame is methodology. Every cell names its serving path because a model served by two hosts is two targets.

Contribute a run

If you have budget, keys, and a model we haven't measured, you can add to this record. The short version of what a submittable run looks like (full manifest, provider request-IDs, n ≥ 3, no post-processing, negative results reported) lives on the submission page. The full protocol lives in the framework repo's CONTRIBUTING.md.

Adjacent: methodology, positioning, the taxonomy, contribute a run.

Every run.

Ten models · One brief · Thirty samples each

Seven models across four providers

Five models · n=5 · Zero errors

What's in each report

What a run does not carry

Contribute a run