AHD · Methodology
How we measure.
The headline numbers on the home page deserve a plain-English explanation. This is that.
The pairing
Every measured run takes one brief, resolves it against one style token, and runs it through a set of models in two conditions. The raw condition gives the model the brief as prose: the intent, the audience, the surfaces, the things the brief explicitly wants and the things it explicitly bans. Nothing else. No style token, no AHD system prompt, no forbidden list.
The compiled condition gives the model the same brief, in the same words, plus the AHD system prompt for that token. The system prompt names the style direction (grid, typography, palette, motion policy), names the forbidden patterns from the taxonomy, and asks the model to cite the rule it is following in an inline comment per decision.
The only thing that differs between raw and compiled is the AHD system-prompt layer. Everything else is held constant. This is the controlled comparison the framework exists to support.
A cell
A cell is a specific combination of one model and one condition. The 21 April 2026 run used five models and two conditions, so ten cells. Each cell received five samples, written as n=5 per cell. Five models times two conditions times five samples is fifty generated HTML pages. Each was linted. The tell counts were averaged per cell. The delta is the per-model difference between the raw-condition mean and the compiled-condition mean.
Why n=5 is under-powered, honestly
With five samples per cell, the statistical precision is poor. If Mistral's compiled condition averaged 1.25 tells per page, the true underlying average could plausibly be anywhere from about 0.4 to 2.1 — a Wilson confidence interval of roughly plus or minus thirty-five percentage points. The per-model percentages on the home page point at directions, not at precision.
A credible benchmark run needs n greater than or equal to thirty samples per cell. Thirty shrinks the interval from roughly a third of the range to roughly a tenth, which is the point where a number is quotable in a serious setting rather than directional. The substrate to run n=30 is ready; what is not yet spent is the API budget. A larger run is on the roadmap as a budget decision.
Attempted, extracted, scored
Every published eval report lists four counts per cell: attempted (runs initiated), errored (API or runtime failures), extraction failed (the response contained no usable HTML), and scored (samples that actually reached the linter). A large gap between attempted and scored is a signal that a model is struggling with the instruction, not that it passed the taxonomy.
An earlier version of the runner silently dropped failed samples and reported only the scored count. We changed it because survivorship bias of that shape made the headline flattering for the wrong reasons. Present counts are separate for exactly that reason.
Mean tells per page is a proxy, not a verdict
Fewer tells per page is not identical to better design. A page with almost no content has nothing for the linter to fire on and will score near zero regardless of intent. A page with genuine ambition exposes more decision surface and can legitimately trip more rules than a thin page. Tells-per-page is a useful proxy for aggregate slop fingerprint, not a single-sample judgement call. Always read a per-cell number alongside the rendered output, not in isolation. The published reports link to the samples.
The linter as scorer
The scoring engine is the same deterministic ruleset that ships
with the ahd lint CLI. Twenty-eight rules decide from
HTML and CSS. Three decide from SVG. Thirteen vision-only rules
live behind the Anthropic-multimodal critic and only run when
we've rendered the sample. A per-release inventory of exactly
which rules ran is stamped into the report header so an older run
remains auditable after the ruleset changes.
Negative results are first-class
The home page states the Llama 3.3 70B cell regressed one hundred and fifty percent and that Qwen 2.5 Coder 32B did not move. The SDXL image cell ignored the compiled negative entirely. We publish these because a framework that only surfaced wins would be ornamental, and because the comparative shape of who-benefits is more actionable than any single aggregate number. If a model does not improve under the compiled brief for a given token, the correct move is to route around that model for that token, not to blame the framework.
What compiled does and does not add
The compiled system prompt is a structured document. The style
direction comes from the token's prose fragment. The forbidden
list is drawn from the token's forbidden: array
merged with any brief-level mustAvoid. The required
quirks come from the token's required-quirks array.
The full spec is also serialised to JSON and included so a model
that responds better to structured input has it available. The
final working rules ask the model to cite the rule it follows
per decision, to return single HTML output (in mode: final,
used by the eval), and to favour subtraction when in doubt.
What the compiled prompt does not add: API keys, per-model tuning, any hidden example shots, or any automation that silently rejects and regenerates a bad response. The runner calls the model once per sample. If the response is bad, it shows up bad in the report.
Reproducing a run
Every measurement published on this site points at a dated report
in docs/evals/ in the repository. Each report carries
its run manifest, which records the exact brief path, the exact
model specifications (including canonical model identifiers like
@cf/mistralai/mistral-small-3.1-24b-instruct, not
just 'Mistral'), the n per cell, and the ISO timestamp. Given the
manifest and a current version of the framework, the run is
reproducible in one command:
ahd eval-live <token> --brief <brief.yml> \
--models <specs> --n <N> \
--report docs/evals/<date>-<token>.md Model versions do change, so an exact re-run against the same canonical identifier is only guaranteed to be close, not identical. The manifest records the identifier; reproducing a run against a different model is honest and what the manifest is for.
Written on 21 April 2026. Updated as the methodology evolves. For the list of rules the linter actually enforces, see the taxonomy. For the code, see the framework repository.