AHD · Evals · Contribute · Agents
Contribute as an autonomous agent.
The agent-targeted contribution contract. Composes with the human submission protocol; it does not replace it. The human contract names what a submission is; this page names what an agent must do before pressing submit.
This page mirrors
CONTRIBUTING-AGENTS.md
in the framework repo. If this page drifts from that file, the
file wins.
Trust gate (read this first)
You are not the gate. The maintainer is.
Every submitted run is re-linted on receipt against the same samples the agent submitted. The numbers a contributor publishes are not authoritative; the maintainer's re-run is. An agent that submits fabricated numbers is detected at the re-lint, not at the social layer. The optimal agent strategy is to submit honest, complete manifests; cleverness has no payoff.
The maintainer also uses provider request-IDs (Anthropic
request-id, OpenAI x-request-id,
Cloudflare cf-ray, Google
x-goog-api-client-request-id) to verify that the
API calls actually happened. Do not strip these from the
envelope.
Required steps, in order
1. Read briefs/landing.yml or the brief you intend to submit against.
2. Read tokens/<token-id>.yml. Note any lint-overrides.disable list.
3. Run: ahd eval-live <token> --brief <path> --models <spec,...> --n 30 --out <dir>
4. Run: ahd eval <token> --samples <dir> --report <out>.md
5. Verify: ahd lint --token <token> <a sample>.html (sanity)
6. Validate: ahd validate-submission <run-dir>
(current schema must PASS; target warnings non-blocking)
7. Stop. Surface to your principal:
- Headline numbers (per-cell raw mean, compiled mean, reduction)
- Any cells with extractionFailed > 0 OR errored > 0
- Any cells with reduction < 0 (regressions are first-class data)
- validate-submission output (target-schema warnings)
- Full output directory path
8. Wait for explicit human authorisation before opening a PR. A submission that bypasses step 8 is not a submission, regardless of how clean the numbers look.
Required envelope
The unit of submission is the directory tree
ahd eval-live produces:
<token>/
manifest.json # canonical model ids + serving paths + request-IDs
compiled-prompt.txt # exact bytes sent to provider
<cell-sanitized-id>/
raw/
sample-001.html # extracted HTML
sample-001.envelope.json # full provider response, headers, finish reason, token usage
...
compiled/
sample-001.html
sample-001.envelope.json
...
Minimum n per cell is 30. Three is acceptable for a
probe PR (probe: in the title) but never for a
publishable run.
Required manifest fields
The shape ahd validate-submission enforces. Pass the
current schema to be reviewable today; the
target schema names aspirational fields that surface
as warnings rather than rejections.
{
"token": "<id>",
"briefPath": "briefs/<file>.yml",
"n": 30,
"runAt": "<ISO-8601 UTC>",
"models": [
{
"spec": "<runner>:<model>",
"canonicalId": "<exact-model-id>",
"sanitizedId": "<filesystem-safe>",
"provider": "<runner-name>",
"servingPath": "<full URL or @cf/.../...>",
"cliVersion": "<x.y.z>",
"runnerVersion": "<ahd-pkg-version>",
"requestIds": ["<list>"]
}
]
} If your runner does not capture one of the target fields, do not fabricate it. Use the value the runner actually surfaces (omitting the field is acceptable while it remains target-state) and surface the gap to your principal so they can decide whether to submit a current-shape run or wait for the runner to be patched. Missing-field-with-disclosure is honest evidence; guessed-field-without-disclosure is fabrication.
Forbidden behaviours
Auto-reject on the maintainer side. An agent that does these will burn the principal's reputation.
- Post-processing samples: pretty-printing, whitespace stripping, re-wrapping, HTML re-emission. The envelope must reflect the model's raw output.
- Dropping cells with negative reductions, errors, or extraction failures. Negative results are first-class data.
- Submitting
n = 1orn = 2outside an explicitprobe:PR. - Editing the compiled prompt to make a particular cell perform better. The prompt is the AHD compiler's output; tampering invalidates the run.
- Re-running until a clean cell appears. Stochastic models will always have a "good run" by chance; cherry-picking is the canonical fabrication fingerprint.
- Opening a PR without explicit human authorisation. "Full autonomy" does not extend to public-facing artefacts.
Token-aware lint
Two layers, with different guarantees. Don't conflate them.
Compiler contract (deterministic, default-on).
When ahd compile or ahd eval-live
builds a compiled prompt, the prompt template instructs the
model to emit <meta name="ahd-token" content="<id>">
in the document head. The token's
lint-overrides.disable[] block is authoritative:
when ahd lint or ahd eval is invoked
with --token <id> (or with token auto-detected
from the meta anchor), those rules are silenced.
Runner / operator fallback (best-effort).
Auto-detection of the active token from the meta tag depends on
the model having actually emitted the anchor as instructed.
Frontier models typically comply; smaller OSS models sometimes
drop the meta in favour of decorative content.
Pass --token <id> explicitly
at lint and eval time when you control the invocation. Do not
rely on auto-detection for submitted runs; the explicit flag is
the contract, the meta sniff is a convenience.
Reduced-motion in compiled output
The compiled prompt requires that any animation or transition
longer than 200ms is wrapped in
@media (prefers-reduced-motion: no-preference) or
provides an equivalent reduce-motion fallback (WCAG 2.3.3). The
ahd/respect-reduced-motion rule fires on output
that ignores this. If a model frequently emits ungated motion
under your runner, that is reportable as a model-behaviour
finding; do not silence the rule to make the cell read cleaner.
Probe versus publishable
| Kind | Min n | PR title prefix | Maintainer treatment |
|---|---|---|---|
| probe | 3 | probe: | Light review; not added to /evals. |
| publishable | 30 | evals: | Full re-lint review; published if clean. |
A publishable PR with n = 3 will be relabelled
probe.
What the agent must surface to its principal before submitting
The part automated tooling reliably gets wrong. The agent must present the principal with, at minimum:
- Run path: <abs path to output dir>
- Cells: <count> · samples per cell: <n> · total: <count*n*2>
- Per-cell verdict (raw mean → compiled mean → reduction%)
- Any cell with errored > 0 OR extractionFailed > 0
- Any cell where reduction < 0 (a regression is a result, not a problem)
- Anything the agent silently changed (prompt overrides, lint-override
edits, sample renames, post-processing). Disclose, don't hide. If the principal does not respond, the agent does not submit. The default action when there is doubt is wait.
When in doubt
Ask the principal. If the principal is unreachable, write a
status file at <run-dir>/.ahd-blocker.json and
stop. A run that sits on disk for a day costs nothing; a bad
submission costs the framework's credibility.
The status file is named .ahd-blocker.json so
multiple agents working on the same machine produce a consistent
artefact the principal can inspect by directory. Shape:
{
"createdAt": "<ISO-8601 UTC>",
"agent": "<agent identifier, e.g. 'claude-code/2.x'>",
"runDir": "<absolute path>",
"stage": "compile | eval-live | eval | lint | publish",
"completed": ["compile", "eval-live"],
"blockedOn": "<one-line reason>",
"needsFromPrincipal": [
"<concrete ask, e.g. 'rotate CF_API_TOKEN'>"
],
"summary": {
"cells": <count>,
"n": <count>,
"samplesOnDisk": <count>,
"errored": <count>,
"extractionFailed": <count>
}
}
Required fields: createdAt, agent,
stage, blockedOn. Everything else is
best-effort. Multiple blocker files are allowed (numbered
.ahd-blocker-1.json,
.ahd-blocker-2.json); the principal scans the run
directory to surface them.
Canonical source for this contract is
CONTRIBUTING-AGENTS.md
in the framework repo. The human-targeted submission protocol
lives at
/evals/contribute; the JSON
Schema enforcement is at
schema/manifest.current.schema.json and
schema/manifest.target.schema.json in the framework
repo, exercised by the
ahd validate-submission CLI.