AHD · Eval report · 21 April 2026

swiss-editorial · n=5 per cell.

Name: AHD Eval · swiss-editorial · 21 April 2026 · n=5
Creator: Ad Astra Computing Inc
Published: 2026-04-21
License: https://opensource.org/license/mit

Same brief, same seed, raw vs AHD-compiled, five models in parallel. Fifty samples total, zero errors. Aggregate 2.08 → 1.04 mean slop tells per page. Three models helped, one unmoved, one regressed.

Run manifest

Briefbriefs/landing.yml
Tokenswiss-editorial
Samples per cell5
Max output tokens12000
Ruleset at run time28 HTML/CSS rules (linter), source-level only

Models (canonical identifiers, preserved from evals/<token>/manifest.json): claude-opus-4-7, @cf/mistralai/mistral-small-3.1-24b-instruct, @cf/meta/llama-4-scout-17b-16e-instruct, @cf/qwen/qwen2.5-coder-32b-instruct, @cf/meta/llama-3.3-70b-instruct-fp8-fast.

Per-model slop reduction

Model	Raw → scored	Compiled → scored	Raw mean	Compiled mean	Δ	Reduction
`claude-opus-4-7`	5 → 5	5 → 5	1.20	0.00	1.20	100%
`@cf/mistralai/mistral-small-3.1-24b-instruct`	5 → 4	5 → 4	3.25	1.25	2.00	62%
`@cf/meta/llama-4-scout-17b-16e-instruct`	5 → 5	5 → 5	2.40	1.20	1.20	50%
`@cf/qwen/qwen2.5-coder-32b-instruct`	5 → 5	5 → 5	2.80	2.80	0.00	0%
`@cf/meta/llama-3.3-70b-instruct-fp8-fast`	5 → 5	5 → 5	0.40	1.00	−0.60	−150%

Reading this table

The five percentages point at directions, not at precision. With n=5 per cell, the Wilson confidence interval on each per-model number is roughly plus or minus thirty-five percentage points. A larger run is on the roadmap. See how we measure for the full methodology discussion.

Mistral's two-out-of-five extraction failures per condition are visible in the raw-scored and compiled-scored columns. The mean tells are computed over the scored subset; the gap between attempted and scored is a signal of model compliance, not a silent drop.

What the framework exposes

Three models responded to the compiled brief. Claude dropped to zero tells, Mistral cut slop by sixty-two percent, Scout by fifty. These are the cases where AHD's system prompt moved the output.

Qwen 2.5 Coder did not move. The compiled system prompt produced output that trips the same rules at the same rate as the raw prompt. Qwen is a code-specialised model that appears to hold its defaults across prompt changes on this brief.

Llama 3.3 70B regressed under the compiled prompt. Its raw output is a typographically thin page that tripped almost no rules; the compiled output is richer and ambitious enough to trip more. This is not the framework failing; it is the framework correctly reporting that the compiled layer does not help every model.

Caveats

Scoring runs the deterministic AHD linter over every sample that passes a basic HTML sanity check. Samples that fail (empty responses, reasoning-only outputs) are dropped from scoring and counted in the extraction failed column. Reasoning-model <think> blocks are stripped before extraction.

Raw condition: the brief is expanded as plain prose (intent, audience, surfaces, must-include and must-avoid lists) with no AHD system prompt, no style token, no forbidden list. Compiled condition: the same brief plus the AHD-compiled system prompt. The only thing that differs between conditions is the AHD intervention.

Vision-only tells are not scored in this pipeline. A partial vision-critic pass covering twenty-one of forty-eight screenshots ran the same day; see the methodology for a summary. Rate-limit retry has since been added to the critic and a full vision pass is on the roadmap.

Tells-per-page is a proxy metric. A thin page has little surface for rules to fire against. Read the delta alongside the actual rendered HTML, not in isolation. The raw samples are linkable from the run manifest in the framework repository.

The canonical report lives in docs/evals/2026-04-21-swiss.md of the framework repository. This page is the reader view. Adjacent: how we measure, the taxonomy.