AHD Artificial Human Design

AHD · Artificial Human Design

Make it specific.

A guardrail and evaluation layer for AI-generated design. A named taxonomy of thirty-nine slop tells, a token-driven brief compiler, a deterministic linter, and a reproducible raw-vs-compiled eval loop. Across web UI and image generation.


Measured · 22 April 2026

Same brief, raw versus AHD-compiled, ten models, n=30 per cell, six hundred samples. Eight of ten cells reduce tells. Median reduction 59 percent across the positive cells. Click any row for the per-model reading.

gpt-oss 120B
78%↓

Best reduction in the run. 3.50 raw mean tells dropped to 0.77 compiled, 30 of 30 samples scored in both conditions. The compiled prompt moves this OSS model decisively off its median without inducing new tells. Served by Cloudflare Workers AI.

Mistral Small 3.1
62%↓

3.47 raw mean tells dropped to 1.30 compiled, 30 of 30 scored in both conditions. Matches the n=5 signal at tight confidence. The bento-and-gradient raw output collapses toward the Swiss editorial token under the compiled system prompt.

Kimi K2.6
62%↓

2.67 raw mean tells dropped to 1.00 compiled, 30 of 30 scored. The cell required a chat-template fix first: Kimi K2.6 on Cloudflare defaults thinking-mode on, and a 9 KB system prompt exhausted the token budget before any visible output. After patching the runner to pass thinking: false, the cell ran clean. This is a serving-layer defect documented in the serving-tells catalog, separate from the design-slop taxonomy.

Gemini 3.1 Pro Preview
62%↓

2.97 raw mean tells dropped to 1.13 compiled, 30 of 30 scored. Served via Gemini CLI, which is the path most humans actually use for this model today, so the CLI measurement is more ecologically valid than the raw HTTP API would be.

Claude Opus 4.7
59%↓

1.80 raw mean tells dropped to 0.73 compiled, 30 of 30 scored. Served via Claude Code CLI. The n=30 number is tighter but lower than the n=5 reading reported 100 percent reduction at ±35-point uncertainty; 59 percent is the real figure. Same behaviour, measured to a resolution we can now trust.

Llama 3.3 70B
regressed 117%↑

0.28 raw mean tells rose to 0.60 compiled. This reproduces the same-direction regression measured at n=5 on both Cloudflare and Hugging Face serving paths in the 21 April cross-provider run. Llama 3.3's raw output is typographically thin; the compiled brief elicits a richer page with more decision surface, which trips more rules. Framework response: on an editorial brief, do not route to this model.

Full report with attempted-vs-scored counts, per-tell frequency table, serving paths and the run manifest: eval · 22 April 2026. Prior runs: every run. How to read these numbers: the run's own reading guide, or the general methodology.


Four pieces

  1. Named taxonomy

    Thirty-nine concrete slop tells across web, graphic and typographic surfaces. Enforced by 35 HTML/CSS rules, 3 SVG rules, and 14 vision-critic rules on rendered pixels. Read the taxonomy.

  2. Style tokens

    Ten curated design directions spanning Swiss-Editorial, Manual SF, Neubrutalist-Gumroad, Post-Digital, Monochrome-Editorial, Memphis-Clash, Heisei-Retro, Bauhaus-Revival, Editorial- Illustration and Ad-Creative-Collision. Each declares its own forbidden list, required quirks and reference lineage.

  3. Brief compiler

    ahd compile takes a structured intent and emits a token-anchored system prompt for any LLM. Draft mode for exploration, final mode for single-shot output. See how.

  4. Empirical eval

    Raw-vs-compiled controlled comparison across Claude Opus 4.7, GPT-5, Gemini 3 Pro, Llama 3.3 70B, Llama 4 Scout, Mistral Small 3.1, Qwen 2.5 Coder, DeepSeek R1, and image generators FLUX.1 schnell, SDXL Lightning and DreamShaper. Attempted, extracted, scored counts published. Negative results first-class.