AHD · Evals · Contribute
Contribute a run.
Every row on the evals index is a measured
claim, not an opinion. If you have budget, keys, and a model
we haven't measured yet, you can add to the record. This page
is the short version. The full submission protocol, licence
terms and governance live in the framework repo's
CONTRIBUTING.md.
What a submittable run looks like
-
Full manifest, not just rendered samples.
The output directory from
ahd eval-liveorahd eval-imageis the unit of submission. Canonical model identifier, serving path, exact compiled prompt bytes, seed, temperature, per-sample JSON envelopes with attempted bytes, finish reason and token usage. HTML alone is not enough. The envelope is what proves the sample came from the model rather than from a text editor. -
Provider request-IDs, where the provider exposes one.
Anthropic's
request-id, OpenAI'sx-request-id, Cloudflare'scf-ray. The runner captures these automatically. A maintainer can use them to verify the request actually occurred without needing to re-run the eval. Subscription-CLI cells are exempt with a documented caveat, because the provider HTTP layer is hidden behind the CLI. -
At least three samples per cell, same prompt, different seeds.
Single-sample cells are not submittable. A stochastic model returning identical output across three seeds is the canonical fabrication fingerprint; running at n ≥ 3 minimum makes that fingerprint visible without expensive review.
-
No post-processing on the samples.
Don't reformat HTML, strip whitespace, pretty-print or re-wrap. The envelope should reflect the model's raw output. If the runner's extractor pulled HTML from a larger response, keep the full
raw_responsefield. -
Negative results report with the same detail as positive ones.
A cell where compiled lost to raw, or where extraction failed, or where the model produced zero-byte output, is first-class data. Don't drop cells to make the report look cleaner. If a cell failed for a serving-layer reason, say so and link to the relevant entry in the serving-tells catalog.
How review works
A maintainer spot-checks two or three randomly-selected samples by re-running the manifest against the same provider. The goal is distributional agreement, not bit-exact reproduction; sampling is stochastic. If the re-run produces a tell-count meaningfully outside the submitted distribution, we ask for clarification. Where request-IDs are present, we may use them to confirm the requests occurred. We don't automate re-run-on-PR in CI; that would need maintainer-held keys and budget, and we'd rather spend both on new eval axes than on PR verification.
What we won't accept
- Rendered HTML or screenshots without the envelope manifest.
- Reports with n = 1 per cell.
- Reports that drop cells where compiled lost.
- Reports from a provider or model version our runner cannot address with a canonical identifier.
- "Summary" reports that elide attempted-vs-scored counts.
Where to submit
Open a pull request at
github.com/Ad-Astra-Computing/ahd
with the report at
docs/evals/<YYYY-MM-DD>-<token-id>.md
and the manifest directory committed alongside. The PR URL is
the canonical submission surface; include a link to the PR in
the report's header so reviewers can trace the submission.
Sign off each commit with git commit -s per the
DCO requirement documented in
CONTRIBUTING.md.
Canonical source for this protocol is
CONTRIBUTING.md
in the framework repo. If this page drifts from that file,
the file wins. Adjacent reading:
the evals index,
methodology.