AHD · Replay
Verify the inputs.
Every published AHD eval ships a replay block: enough information for a third party to confirm the run consumed the inputs we say it consumed, and to re-issue the same command at the named version. Replay does not promise bit-for-bit reproduction of model output (frontier providers update silently); it promises that the inputs you can challenge are the inputs the run ate. Verifiability first, replayability second.
Two surfaces, one source of truth
The block lands in two places alongside every report. The first
is a fenced ```yaml ahd-replay block at the top of
the markdown report (human-readable, derived). The second is a
<report>.replay.json sidecar next to the
markdown (canonical, schema-validated against
replay.schema.json).
The JSON sidecar is authoritative. The markdown is a derived view; redactions or formatting choices in the markdown never reflect back into the JSON.
Fields
schema_version: 1 # bump on any breaking change
kind: eval-live | critique | eval-image
ahd_version: 0.11.0 # framework version at run time
ahd_commit: <40-hex> # null when not in a git repo
git_dirty: true | false # null when ahd_commit is null
node_version: v22.22.2
platform: darwin-arm64
invoked_at: <ISO-8601 UTC>
argv: [ ... ] # full process.argv as a list
token:
path: tokens/swiss-editorial.yml
hash: sha256:<64-hex>
brief: # null on critique runs
path: briefs/landing.yml
hash: sha256:<64-hex>
sampling:
n: 30
temperature: null # null when not set; per-call default
seed: null # global seed; null when seeds vary per sample
models:
- id: cf:@cf/google/gemma-4-26b-a4b-it
provider: cloudflare-workers-ai
provider_request_ids: [ "req_..." ] # one id per successful provider call.
# Empty for CLI-spawned runners
# (claude-code, gemini-cli, codex)
# since there is no HTTP envelope.
conditions:
requested: [ raw, compiled ]
effective: [ raw, compiled ] Hash contract
Hashes use SHA-256 in the form sha256:<lower-hex 64>.
Two hash modes, one per input shape.
Structured inputs (token, parsed-YAML brief)
The hash is taken over the canonical-JSON serialisation of the resolved object:
- Parse YAML / JSON to a JS value.
- Recursively sort object keys lexicographically. Arrays preserve order (their order is semantic).
JSON.stringifywith no whitespace.- SHA-256 the resulting bytes.
A YAML file whose keys are reordered hashes identically as long
as its parsed value is unchanged. Comments, whitespace and key
ordering do not affect the hash. The contract is the
parsed value, not the file. Reference implementation:
canonicalizeJson + hashJsonCanonical
in the framework's src/eval/replay.ts.
Raw-bytes inputs (markdown briefs)
When the brief is plain markdown (no parser involved), the hash
is taken over the exact file bytes.
verify-replay tries the raw-bytes hash first; if
that fails it falls back to canonical-JSON in case the brief is
structured. The dual path is the verification side of the same
dual path the helper supports; it is not a fallback for
malformed input.
What changes between runs
| Field | Stable across runs of the same command? |
|---|---|
token.hash | Yes, until the token file is edited. |
brief.hash | Yes, until the brief is edited. |
ahd_commit | Yes, until the framework moves. |
invoked_at | No (per-run wall clock). |
provider_request_ids | No (provider-assigned per call). |
sampling.n / models / conditions | Yes, command-controlled. |
When ahd verify-replay reports drift, one of the
stable fields is no longer stable: the token or brief on disk
hashes to something other than the recorded value. The verifier
walks up from the report's path to find the framework's repo
root (package.json with name
@adastracomputing/ahd), so it works from any cwd or
checkout.
What replay does not guarantee
- Bit-for-bit reproduction
- Frontier providers update models silently; running the same command at the same git commit may produce different samples a week later. The block is a verifiability contract first and a replayability contract second.
- Provider-side audit
-
The
provider_request_idsarray holds one id per successful provider call: anthropicrequest-id, openaix-request-id, cloudflarecf-ray, googlex-goog-request-id. With those ids you can ask the provider to verify a specific request existed at the recorded time. AHD does not save the provider's response payload, so the request id alone is not enough to recover the response. CLI-spawned runners (claude-code, gemini-cli, codex) leave the array empty by design; there is no HTTP envelope to read. - Determinism inside the runner
-
AHD's per-sample seed is
i+1today (incremental, not cryptographic). Differentnyields different sets of seeds. Known limitation; future versions may capture per-sample seeds.
Markdown redactions
The markdown rendering surfaces only a count
(provider_request_ids: 3 captured) rather than the
values, until each provider's ids are confirmed safe to publish.
The full ids live in the JSON sidecar; if a published report's
.replay.json is committed to a public repo, the ids
are public.
The argv field is rendered as a quoted shell command
in the markdown (in the trailing replay this run block)
but stored as an array in the JSON to avoid quoting ambiguity.
Backfilled sidecars
Reports published before the replay system existed carry a
sidecar with backfilled: true. The hashes are
reconstructed from the token + brief contents at the report's
git commit, so ahd verify-replay works against them
as long as nobody has rewritten history. What backfilled blocks
lack:
argvis[](the original command was not stored).provider_request_idsare empty (lost; never made it into the markdown).node_versionandplatformare"unknown".temperatureandseedarenull.
A backfilled block is informational about runs you cannot replay verbatim, but it is still verifiable: the hashes pin the inputs to a specific git state.
When schema_version bumps
Bump only on breaking schema changes (renamed
or removed fields, changed semantics of an existing field).
Adding optional fields does not require a bump; consumers should
ignore unknown fields. Removing optional fields is breaking from
the consumer's perspective, so bump. The verifier refuses to
parse schema_version greater than the version it
was built for, on the principle that an unknown major may have
changed semantics it cannot apply correctly.
Verifying a published report
ahd verify-replay docs/evals/2026-04-22-swiss-n30.md
Output is a per-field ok / FAIL list.
Exit code 1 on drift, 0 on clean. CI can use the exit code as a
gate for merging changes that touch tokens or briefs.
Adjacent: how we measure, contribute a run, framework REPLAY.md.