AHD · Artificial Human Design · v0.5.0-beta
Make it specific.
A guardrail and evaluation layer for AI-generated design. A named taxonomy of thirty-eight slop tells, a token-driven brief compiler, a deterministic linter, and a reproducible raw-vs-compiled eval loop. Across web UI and image generation.
Measured · 21 April 2026
Same brief, same seed, raw vs AHD-compiled. Five models, n=5 per cell. Aggregate 2.08 → 1.04 tells per page. Click any row for the per-model reading.
- Claude Opus 4.7
- 100%↓
- Mistral Small 3.1
- 62%↓
- Llama 4 Scout 17B
- 50%↓
- Qwen 2.5 Coder 32B
- 0%
- Llama 3.3 70B
- regressed 150%↑
Dropped from 1.20 mean slop tells per raw page to zero tells under the compiled prompt. Five of five compiled samples scored cleanly across the twenty-eight-rule source linter. This is the cleanest result in the run; on an editorial-landing brief Claude appears to respect every element of the compiled system prompt.
Dropped 3.25 raw mean tells to 1.25 compiled. Four of five samples were scored per cell — one raw and one compiled sample per cell failed extraction (no usable HTML in the response). The gap between attempted and scored is published in the report rather than silently filtered.
Dropped 2.40 raw mean tells to 1.20 compiled. Five of five samples scored in both conditions. A clean directional improvement on a small model; compiled brief moved Scout toward the token's declared structure (grid, type pairing) without tripping new rules.
Raw and compiled means were identical at 2.80 tells per page. The compiled system prompt produced output that trips the same rules at the same rate as the raw prompt. Qwen is a code-specialised model and appears to hold its defaults across prompt changes on this brief. The framework's correct move here is to route around Qwen for this token, not to claim the intervention worked.
Went from 0.40 raw mean tells to 1.00 compiled — a regression. Llama 3.3's raw output is a typographically thin page that trips few rules because it declares little. The compiled prompt elicits a richer page with more decision surface, which trips more rules. This is not a framework failure; it is the framework correctly reporting that the compiled layer does not help every model.
Full report with attempted-vs-scored counts, per-tell frequency table, and the run manifest: eval · 21 April 2026. How to read the numbers: methodology.
Four pieces
-
Named taxonomy
Thirty-eight concrete slop tells across web, graphic and typographic surfaces. Enforced by 28 HTML/CSS rules, 3 SVG rules, and 13 vision-critic rules on rendered pixels. Read the taxonomy.
-
Style tokens
Ten curated design directions spanning Swiss-Editorial, Manual SF, Neubrutalist-Gumroad, Post-Digital, Monochrome-Editorial, Memphis-Clash, Heisei-Retro, Bauhaus-Revival, Editorial- Illustration and Ad-Creative-Collision. Each declares its own forbidden list, required quirks and reference lineage.
-
Brief compiler
ahd compiletakes a structured intent and emits a token-anchored system prompt for any LLM. Draft mode for exploration, final mode for single-shot output. See how. -
Empirical eval
Raw-vs-compiled controlled comparison across Claude Opus 4.7, GPT-5, Gemini 3 Pro, Llama 3.3 70B, Llama 4 Scout, Mistral Small 3.1, Qwen 2.5 Coder, DeepSeek R1, and image generators FLUX.1 schnell, SDXL Lightning and DreamShaper. Attempted, extracted, scored counts published. Negative results first-class.