AHD · Evals
Every run.
One row per AHD eval. Each row links to its own dated report with the full manifest, per-cell attempted/extracted/scored counts, and per-tell frequency tables. Every claim on this site traces to one of these reports.
-
Ten models · One brief · Thirty samples each
Ten models, n=30 per cell, 600 samples. Eight of ten cells showed positive reductions under the compiled prompt, one flat, one regression. Best: gpt-oss-120b at 78.1% fewer tells. Three frontier cells via subscription CLIs (Claude Code, Codex, Gemini CLI), seven OSS via Cloudflare Workers AI. Wilson interval tightens from roughly ±35% at n=5 to roughly ±18% at n=30.
-
Seven models across four providers
Four positive reductions, one inconclusive, two regressions. Llama 3.3's regression reproduces across Cloudflare and Hugging Face, turning a single-cell finding into a cross-provider result.
-
Five models · n=5 · Zero errors
Claude Opus (Anthropic API) plus four OSS models on Cloudflare Workers AI. Claude dropped to zero tells compiled; Llama 3.3 70B regressed. Full per-model and per-tell breakdown with every attempted-vs-scored count published.
What's in each report
Every per-run page carries the same surface: a summary table of raw versus compiled tells per cell, the attempted / extracted / scored column triple for every cell, a per-tell frequency table showing which rules fired in which conditions, the exact brief and compiled-prompt bytes used, and the run manifest with canonical model identifiers and serving paths. Reading a row on this index gives you the headline; clicking into the report gives you the receipts.
What a run does not carry
A run is not a leaderboard. A run is evidence for a specific brief under a specific token against a specific set of models. Different briefs and different tokens produce different rankings. If you want to generalise from any one row to "model X is better than model Y at design," that generalisation is on you, not on us. The frame is methodology. Every cell names its serving path because a model served by two hosts is two targets.
Contribute a run
If you have budget, keys, and a model we haven't measured,
you can add to this record. The short version of what a
submittable run looks like (full manifest, provider
request-IDs, n ≥ 3, no post-processing, negative results
reported) lives on the
submission page. The full
protocol lives in the framework repo's
CONTRIBUTING.md.
Adjacent: methodology, positioning, the taxonomy, contribute a run.