AHD · Evals · Contribute

Contribute a run.

Every row on the evals index is a measured claim, not an opinion. If you have budget, keys, and a model we haven't measured yet, you can add to the record. This page is the short version. The full submission protocol, licence terms and governance live in the framework repo's CONTRIBUTING.md.

What a submittable run looks like

Full manifest, not just rendered samples.

The output directory from ahd eval-live or ahd eval-image is the unit of submission. Canonical model identifier, serving path, exact compiled prompt bytes, seed, temperature, per-sample JSON envelopes with attempted bytes, finish reason and token usage. HTML alone is not enough. The envelope is what proves the sample came from the model rather than from a text editor.
Provider request-IDs, where the provider exposes one.

Anthropic's request-id, OpenAI's x-request-id, Cloudflare's cf-ray, Google's x-goog-request-id. The runner captures these automatically into the replay sidecar's provider_request_ids array. A maintainer can use them to verify the request actually occurred without needing to re-run the eval. Subscription-CLI cells are exempt with a documented caveat, because the provider HTTP layer is hidden behind the CLI.
At least three samples per cell, same prompt, different seeds.

Single-sample cells are not submittable. A stochastic model returning identical output across three seeds is the canonical fabrication fingerprint; running at n ≥ 3 minimum makes that fingerprint visible without expensive review.
No post-processing on the samples.

Don't reformat HTML, strip whitespace, pretty-print or re-wrap. The envelope should reflect the model's raw output. If the runner's extractor pulled HTML from a larger response, keep the full raw_response field.
Negative results report with the same detail as positive ones.

A cell where compiled lost to raw, or where extraction failed, or where the model produced zero-byte output, is first-class data. Don't drop cells to make the report look cleaner. If a cell failed for a serving-layer reason, say so and link to the relevant entry in the serving-tells catalog.
Replay sidecar committed alongside the report.

ahd eval-live --report <file>.md emits <file>.replay.json next to the markdown. Commit both. The sidecar carries SHA-256 hashes of the resolved token + brief, the framework version, the git commit at run time, sampling parameters and provider request ids. Reviewers run ahd verify-replay <file>.md to confirm the inputs you claim still hash to the values on disk. Format and hash discipline: Replay (or the canonical docs/REPLAY.md).

How review works

A maintainer spot-checks two or three randomly-selected samples by re-running the manifest against the same provider. The goal is distributional agreement, not bit-exact reproduction; sampling is stochastic. If the re-run produces a tell-count meaningfully outside the submitted distribution, we ask for clarification. Where request-IDs are present, we may use them to confirm the requests occurred. We don't automate re-run-on-PR in CI; that would need maintainer-held keys and budget, and we'd rather spend both on new eval axes than on PR verification.

What we won't accept

Rendered HTML or screenshots without the envelope manifest.
Reports with n = 1 per cell.
Reports that drop cells where compiled lost.
Reports from a provider or model version our runner cannot address with a canonical identifier.
"Summary" reports that elide attempted-vs-scored counts.
Submissions without the .replay.json sidecar.

Where to submit

Open a pull request at github.com/Ad-Astra-Computing/ahd with the report at docs/evals/<YYYY-MM-DD>-<token-id>.md and the manifest directory committed alongside. The PR URL is the canonical submission surface; include a link to the PR in the report's header so reviewers can trace the submission. Sign off each commit with git commit -s per the DCO requirement documented in CONTRIBUTING.md.

Canonical source for this protocol is CONTRIBUTING.md in the framework repo. If this page drifts from that file, the file wins. Adjacent reading: the evals index, methodology.

Contribute a run.

What a submittable run looks like

Full manifest, not just rendered samples.

Provider request-IDs, where the provider exposes one.

At least three samples per cell, same prompt, different seeds.

No post-processing on the samples.

Negative results report with the same detail as positive ones.

Replay sidecar committed alongside the report.

How review works

What we won't accept

Where to submit