AHD Artificial Human Design

AHD · Evals · Contribute

Contribute a run.

Every row on the evals index is a measured claim, not an opinion. If you have budget, keys, and a model we haven't measured yet, you can add to the record. This page is the short version. The full submission protocol, licence terms and governance live in the framework repo's CONTRIBUTING.md.

What a submittable run looks like

  1. Full manifest, not just rendered samples.

    The output directory from ahd eval-live or ahd eval-image is the unit of submission. Canonical model identifier, serving path, exact compiled prompt bytes, seed, temperature, per-sample JSON envelopes with attempted bytes, finish reason and token usage. HTML alone is not enough. The envelope is what proves the sample came from the model rather than from a text editor.

  2. Provider request-IDs, where the provider exposes one.

    Anthropic's request-id, OpenAI's x-request-id, Cloudflare's cf-ray. The runner captures these automatically. A maintainer can use them to verify the request actually occurred without needing to re-run the eval. Subscription-CLI cells are exempt with a documented caveat, because the provider HTTP layer is hidden behind the CLI.

  3. At least three samples per cell, same prompt, different seeds.

    Single-sample cells are not submittable. A stochastic model returning identical output across three seeds is the canonical fabrication fingerprint; running at n ≥ 3 minimum makes that fingerprint visible without expensive review.

  4. No post-processing on the samples.

    Don't reformat HTML, strip whitespace, pretty-print or re-wrap. The envelope should reflect the model's raw output. If the runner's extractor pulled HTML from a larger response, keep the full raw_response field.

  5. Negative results report with the same detail as positive ones.

    A cell where compiled lost to raw, or where extraction failed, or where the model produced zero-byte output, is first-class data. Don't drop cells to make the report look cleaner. If a cell failed for a serving-layer reason, say so and link to the relevant entry in the serving-tells catalog.

How review works

A maintainer spot-checks two or three randomly-selected samples by re-running the manifest against the same provider. The goal is distributional agreement, not bit-exact reproduction; sampling is stochastic. If the re-run produces a tell-count meaningfully outside the submitted distribution, we ask for clarification. Where request-IDs are present, we may use them to confirm the requests occurred. We don't automate re-run-on-PR in CI; that would need maintainer-held keys and budget, and we'd rather spend both on new eval axes than on PR verification.

What we won't accept

Where to submit

Open a pull request at github.com/Ad-Astra-Computing/ahd with the report at docs/evals/<YYYY-MM-DD>-<token-id>.md and the manifest directory committed alongside. The PR URL is the canonical submission surface; include a link to the PR in the report's header so reviewers can trace the submission. Sign off each commit with git commit -s per the DCO requirement documented in CONTRIBUTING.md.

Canonical source for this protocol is CONTRIBUTING.md in the framework repo. If this page drifts from that file, the file wins. Adjacent reading: the evals index, methodology.