AHD Artificial Human Design

AHD · Artificial Human Design

Make it specific.

A guardrail and evaluation layer for AI-generated design. A named taxonomy of thirty-nine slop tells, a token-driven brief compiler, a deterministic linter, and a reproducible raw-vs-compiled eval loop. Across web UI and image generation.

npm install --save-dev @adastracomputing/ahd

Or try it without installing:

npx @adastracomputing/ahd lint page.html

Full setup, npm, source.


Measured · 24 April 2026 · cross-token triangulation

Same brief, different style token (post-digital-green), eleven models, n=30 per cell, six hundred sixty samples. The first triangulation surfaced a real limit in the rule design: editorially-opinionated rules fired on output that was correct for a token they were not written for. AHD shipped token-aware linting in response. Re-linting the same samples under the corrected ruleset moves six of eleven cells positive, with gpt-oss leading at 47.6 percent reduction. Click any row for the post-fix per-cell reading.

gpt-oss 120B
48%↓

Best reduction in the run after token-aware lint. 1.40 raw mean tells dropped to 0.73 compiled, 30 of 30 scored. The cell led the pre-fix reading too (+19.8% pre-fix, +47.6% post-fix): the post-digital-green-correct output that the old ruleset under-credited now scores fully. Served by Cloudflare Workers AI.

Gemma 4 26B
50%↓

Flipped sign under token-aware lint: pre-fix −10.5%, post-fix +50.0%. 0.87 raw mean tells dropped to 0.43 compiled, 30 of 30 scored. Three of the four sign-flippers (gemma, kimi, gemini) cluster at the OSS frontier; silencing the editorially-opinionated rules unblocked their compiled output.

Kimi K2.6
30%↓

Flipped sign: pre-fix −14.4%, post-fix +30.5%. 1.97 raw mean tells to 1.37 compiled, 30 of 30 scored after the chat-template fix (Kimi defaults thinking-mode on, which exhausted the token budget). Serving fix lives in the serving-tells catalog.

Gemini 3.1 Pro Preview
26%↓

Flipped sign: pre-fix −9.9%, post-fix +26.3%. 1.20 raw mean tells to 0.88 compiled, 26 of 30 scored (the same intermittent four-of-thirty no-output behaviour observed on the 22 April run). Served via Gemini CLI.

Mistral Small 3.1
22%↓

Stays positive in both readings: pre-fix +34.7%, post-fix +21.7%. 1.53 raw mean tells to 1.20 compiled, 30 of 30 scored. The token-aware re-lint trims the headline margin but leaves the verdict unchanged: compilation helps this cell.

Claude Opus 4.7
regressed 68%↑

Pre-fix −172.9%, post-fix −67.6%: sixty percent of the regression closed by token-aware lint, but the cell still scores worse compiled than raw. 1.13 raw mean tells rose to 1.90 compiled. The remaining gap is rules outside the suppression list firing on Claude's compiled output; either the model or the compiled prompt has more work to do here. Served via Claude Code CLI.

gpt-5.5
regressed 9%↑

Released by OpenAI during the run window and tested inline. Pre-fix −135.5%, post-fix −9.1%: a near-closed regression. 0.37 raw mean tells to 0.40 compiled, 30 of 30 scored, the cleanest raw frontier baseline measured to date. The new model produces tighter HTML by default (~8 KB per sample versus gpt-5.4's 11 KB raw). Served via Codex CLI on ChatGPT auth.

gpt-5.4
regressed 36%↑

Pre-fix −78.6%, post-fix −36.4%: half the regression closed. 0.37 raw mean tells to 0.50 compiled, 30 of 30 scored. Same pattern as Claude: rules outside the suppression list still firing. Served via Codex CLI.

Eight cells shown. Three cells with very low absolute baselines (Llama 3.3, Llama 4 Scout, Qwen3 30B) show numerically large percentage moves on tiny absolute changes; full eleven-cell breakdown plus the pre-fix-versus-post-fix table lives at eval · 24 April 2026.


Measured · 22 April 2026 · single-token n=30

Same brief, raw versus AHD-compiled, ten models, n=30 per cell, six hundred samples. Eight of ten cells reduce tells. Median reduction 59 percent across the positive cells. Click any row for the per-model reading.

gpt-oss 120B
78%↓

Best reduction in the run. 3.50 raw mean tells dropped to 0.77 compiled, 30 of 30 samples scored in both conditions. The compiled prompt moves this OSS model decisively off its median without inducing new tells. Served by Cloudflare Workers AI.

Mistral Small 3.1
62%↓

3.47 raw mean tells dropped to 1.30 compiled, 30 of 30 scored in both conditions. Matches the n=5 signal at tight confidence. The bento-and-gradient raw output collapses toward the Swiss editorial token under the compiled system prompt.

Kimi K2.6
62%↓

2.67 raw mean tells dropped to 1.00 compiled, 30 of 30 scored. The cell required a chat-template fix first: Kimi K2.6 on Cloudflare defaults thinking-mode on, and a 9 KB system prompt exhausted the token budget before any visible output. After patching the runner to pass thinking: false, the cell ran clean. This is a serving-layer defect documented in the serving-tells catalog, separate from the design-slop taxonomy.

Gemini 3.1 Pro Preview
62%↓

2.97 raw mean tells dropped to 1.13 compiled, 30 of 30 scored. Served via Gemini CLI, which is the path most humans actually use for this model today, so the CLI measurement is more ecologically valid than the raw HTTP API would be.

Claude Opus 4.7
59%↓

1.80 raw mean tells dropped to 0.73 compiled, 30 of 30 scored. Served via Claude Code CLI. The n=30 number is tighter but lower than the n=5 reading reported 100 percent reduction at ±35-point uncertainty; 59 percent is the real figure. Same behaviour, measured to a resolution we can now trust.

Llama 3.3 70B
regressed 117%↑

0.28 raw mean tells rose to 0.60 compiled. This reproduces the same-direction regression measured at n=5 on both Cloudflare and Hugging Face serving paths in the 21 April cross-provider run. Llama 3.3's raw output is typographically thin; the compiled brief elicits a richer page with more decision surface, which trips more rules. Framework response: on an editorial brief, do not route to this model.

Full report with attempted-vs-scored counts, per-tell frequency table, serving paths and the run manifest: eval · 22 April 2026. Different-token follow-up: eval · 24 April 2026. Every run: /evals. How to read these numbers: the run's own reading guide, or the general methodology.


Four pieces

  1. Named taxonomy

    Thirty-nine concrete slop tells across web, graphic and typographic surfaces. Enforced by 35 HTML/CSS rules, 3 SVG rules, and 14 vision-critic rules on rendered pixels. Read the taxonomy.

  2. Style tokens

    Ten curated design directions spanning Swiss-Editorial, Manual SF, Neubrutalist-Gumroad, Post-Digital, Monochrome-Editorial, Memphis-Clash, Heisei-Retro, Bauhaus-Revival, Editorial- Illustration and Ad-Creative-Collision. Each declares its own forbidden list, required quirks and reference lineage.

  3. Brief compiler

    ahd compile takes a structured intent and emits a token-anchored system prompt for any LLM. Draft mode for exploration, final mode for single-shot output. See how.

  4. Empirical eval

    Raw-vs-compiled controlled comparison across Claude Opus 4.7, GPT-5, Gemini 3 Pro, Llama 3.3 70B, Llama 4 Scout, Mistral Small 3.1, Qwen 2.5 Coder, DeepSeek R1, and image generators FLUX.1 schnell, SDXL Lightning and DreamShaper. Attempted, extracted, scored counts published. Negative results first-class.