AHD · Artificial Human Design
Make it specific.
A guardrail and evaluation layer for AI-generated design. A named taxonomy of thirty-nine slop tells, a token-driven brief compiler, a deterministic linter, and a reproducible raw-vs-compiled eval loop. Across web UI and image generation.
Measured · 22 April 2026
Same brief, raw versus AHD-compiled, ten models, n=30 per cell, six hundred samples. Eight of ten cells reduce tells. Median reduction 59 percent across the positive cells. Click any row for the per-model reading.
- gpt-oss 120B
- 78%↓
- Mistral Small 3.1
- 62%↓
- Kimi K2.6
- 62%↓
- Gemini 3.1 Pro Preview
- 62%↓
- Claude Opus 4.7
- 59%↓
- Llama 3.3 70B
- regressed 117%↑
Best reduction in the run. 3.50 raw mean tells dropped to 0.77 compiled, 30 of 30 samples scored in both conditions. The compiled prompt moves this OSS model decisively off its median without inducing new tells. Served by Cloudflare Workers AI.
3.47 raw mean tells dropped to 1.30 compiled, 30 of 30 scored in both conditions. Matches the n=5 signal at tight confidence. The bento-and-gradient raw output collapses toward the Swiss editorial token under the compiled system prompt.
2.67 raw mean tells dropped to 1.00 compiled, 30 of 30 scored.
The cell required a chat-template fix first: Kimi K2.6 on
Cloudflare defaults thinking-mode on, and a 9 KB system prompt
exhausted the token budget before any visible output. After
patching the runner to pass thinking: false, the
cell ran clean. This is a serving-layer defect documented in
the serving-tells catalog,
separate from the design-slop taxonomy.
2.97 raw mean tells dropped to 1.13 compiled, 30 of 30 scored. Served via Gemini CLI, which is the path most humans actually use for this model today, so the CLI measurement is more ecologically valid than the raw HTTP API would be.
1.80 raw mean tells dropped to 0.73 compiled, 30 of 30 scored. Served via Claude Code CLI. The n=30 number is tighter but lower than the n=5 reading reported 100 percent reduction at ±35-point uncertainty; 59 percent is the real figure. Same behaviour, measured to a resolution we can now trust.
0.28 raw mean tells rose to 0.60 compiled. This reproduces the same-direction regression measured at n=5 on both Cloudflare and Hugging Face serving paths in the 21 April cross-provider run. Llama 3.3's raw output is typographically thin; the compiled brief elicits a richer page with more decision surface, which trips more rules. Framework response: on an editorial brief, do not route to this model.
Full report with attempted-vs-scored counts, per-tell frequency table, serving paths and the run manifest: eval · 22 April 2026. Prior runs: every run. How to read these numbers: the run's own reading guide, or the general methodology.
Four pieces
-
Named taxonomy
Thirty-nine concrete slop tells across web, graphic and typographic surfaces. Enforced by 35 HTML/CSS rules, 3 SVG rules, and 14 vision-critic rules on rendered pixels. Read the taxonomy.
-
Style tokens
Ten curated design directions spanning Swiss-Editorial, Manual SF, Neubrutalist-Gumroad, Post-Digital, Monochrome-Editorial, Memphis-Clash, Heisei-Retro, Bauhaus-Revival, Editorial- Illustration and Ad-Creative-Collision. Each declares its own forbidden list, required quirks and reference lineage.
-
Brief compiler
ahd compiletakes a structured intent and emits a token-anchored system prompt for any LLM. Draft mode for exploration, final mode for single-shot output. See how. -
Empirical eval
Raw-vs-compiled controlled comparison across Claude Opus 4.7, GPT-5, Gemini 3 Pro, Llama 3.3 70B, Llama 4 Scout, Mistral Small 3.1, Qwen 2.5 Coder, DeepSeek R1, and image generators FLUX.1 schnell, SDXL Lightning and DreamShaper. Attempted, extracted, scored counts published. Negative results first-class.