AHD · Replay

Verify the inputs.

Every published AHD eval ships a replay block: enough information for a third party to confirm the run consumed the inputs we say it consumed, and to re-issue the same command at the named version. Replay does not promise bit-for-bit reproduction of model output (frontier providers update silently); it promises that the inputs you can challenge are the inputs the run ate. Verifiability first, replayability second.

Two surfaces, one source of truth

The block lands in two places alongside every report. The first is a fenced ```yaml ahd-replay block at the top of the markdown report (human-readable, derived). The second is a <report>.replay.json sidecar next to the markdown (canonical, schema-validated against replay.schema.json).

The JSON sidecar is authoritative. The markdown is a derived view; redactions or formatting choices in the markdown never reflect back into the JSON.

Fields

Replay sidecar shape (fields shown with example values).

schema_version: 1                # bump on any breaking change
kind: eval-live | critique | eval-image
ahd_version: 0.11.0              # framework version at run time
ahd_commit: <40-hex>             # null when not in a git repo
git_dirty: true | false          # null when ahd_commit is null
node_version: v22.22.2
platform: darwin-arm64
invoked_at: <ISO-8601 UTC>
argv: [ ... ]                    # full process.argv as a list
token:
  path: tokens/swiss-editorial.yml
  hash: sha256:<64-hex>
brief:                           # null on critique runs
  path: briefs/landing.yml
  hash: sha256:<64-hex>
sampling:
  n: 30
  temperature: null              # null when not set; per-call default
  seed: null                     # global seed; null when seeds vary per sample
models:
  - id: cf:@cf/google/gemma-4-26b-a4b-it
    provider: cloudflare-workers-ai
    provider_request_ids: [ "req_..." ]  # one id per successful provider call.
                                         # Empty for CLI-spawned runners
                                         # (claude-code, gemini-cli, codex)
                                         # since there is no HTTP envelope.
conditions:
  requested: [ raw, compiled ]
  effective: [ raw, compiled ]

Hash contract

Hashes use SHA-256 in the form sha256:<lower-hex 64>. Two hash modes, one per input shape.

Structured inputs (token, parsed-YAML brief)

The hash is taken over the canonical-JSON serialisation of the resolved object:

Parse YAML / JSON to a JS value.
Recursively sort object keys lexicographically. Arrays preserve order (their order is semantic).
JSON.stringify with no whitespace.
SHA-256 the resulting bytes.

A YAML file whose keys are reordered hashes identically as long as its parsed value is unchanged. Comments, whitespace and key ordering do not affect the hash. The contract is the parsed value, not the file. Reference implementation: canonicalizeJson + hashJsonCanonical in the framework's src/eval/replay.ts.

Raw-bytes inputs (markdown briefs)

When the brief is plain markdown (no parser involved), the hash is taken over the exact file bytes. verify-replay tries the raw-bytes hash first; if that fails it falls back to canonical-JSON in case the brief is structured. The dual path is the verification side of the same dual path the helper supports; it is not a fallback for malformed input.

What changes between runs

Field	Stable across runs of the same command?
`token.hash`	Yes, until the token file is edited.
`brief.hash`	Yes, until the brief is edited.
`ahd_commit`	Yes, until the framework moves.
`invoked_at`	No (per-run wall clock).
`provider_request_ids`	No (provider-assigned per call).
`sampling.n` / `models` / `conditions`	Yes, command-controlled.

When ahd verify-replay reports drift, one of the stable fields is no longer stable: the token or brief on disk hashes to something other than the recorded value. The verifier walks up from the report's path to find the framework's repo root (package.json with name @adastracomputing/ahd), so it works from any cwd or checkout.

What replay does not guarantee

Bit-for-bit reproduction: Frontier providers update models silently; running the same command at the same git commit may produce different samples a week later. The block is a verifiability contract first and a replayability contract second.
Provider-side audit: The provider_request_ids array holds one id per successful provider call: anthropic request-id, openai x-request-id, cloudflare cf-ray, google x-goog-request-id. With those ids you can ask the provider to verify a specific request existed at the recorded time. AHD does not save the provider's response payload, so the request id alone is not enough to recover the response. CLI-spawned runners (claude-code, gemini-cli, codex) leave the array empty by design; there is no HTTP envelope to read.
Determinism inside the runner: AHD's per-sample seed is i+1 today (incremental, not cryptographic). Different n yields different sets of seeds. Known limitation; future versions may capture per-sample seeds.

Markdown redactions

The markdown rendering surfaces only a count (provider_request_ids: 3 captured) rather than the values, until each provider's ids are confirmed safe to publish. The full ids live in the JSON sidecar; if a published report's .replay.json is committed to a public repo, the ids are public.

The argv field is rendered as a quoted shell command in the markdown (in the trailing replay this run block) but stored as an array in the JSON to avoid quoting ambiguity.

Backfilled sidecars

Reports published before the replay system existed carry a sidecar with backfilled: true. The hashes are reconstructed from the token + brief contents at the report's git commit, so ahd verify-replay works against them as long as nobody has rewritten history. What backfilled blocks lack:

argv is [] (the original command was not stored).
provider_request_ids are empty (lost; never made it into the markdown).
node_version and platform are "unknown".
temperature and seed are null.

A backfilled block is informational about runs you cannot replay verbatim, but it is still verifiable: the hashes pin the inputs to a specific git state.

When `schema_version` bumps

Bump only on breaking schema changes (renamed or removed fields, changed semantics of an existing field). Adding optional fields does not require a bump; consumers should ignore unknown fields. Removing optional fields is breaking from the consumer's perspective, so bump. The verifier refuses to parse schema_version greater than the version it was built for, on the principle that an unknown major may have changed semantics it cannot apply correctly.

Verifying a published report

Re-hash the inputs of a published run.

ahd verify-replay docs/evals/2026-04-22-swiss-n30.md

Output is a per-field ok / FAIL list. Exit code 1 on drift, 0 on clean. CI can use the exit code as a gate for merging changes that touch tokens or briefs.

Adjacent: how we measure, contribute a run, framework REPLAY.md.