Vela

frontiers / frontier

AI-for-science benchmark state

constellation seal · derived from vfr_efc649fd772a1ff1
id
vfr_efc649fd772a1ff1
license
CC-BY-4.0
findings
12
accepted core
12
contested
0
links
0
sources
1
evidence
12
avg conf
0.30

used by 0 · replayed by 1 producer · second seat open

e24/24 · finding.noted · reviewer:will-blair · 2026-06-10 · 6c12→d02f

Reviewable change

back to review

finding.note

verified — A frozen deterministic verifier re-checked the claim and passed.accepted

BENCHMARK META (MiniF2F). MiniF2F is ~488 olympiad/textbook formal-math problems (AMC/AIME/IMO + MATH), ported to Lean/Isabelle/HOL-Light/Metamath, split valid/test. KNOWN TRUST ISSUE: multiple incompatible versions exist (original 2021, miniF2F-v2, and the 'miniF2F Revisited' cleanup with corrected/changed statements), so pass-rates across papers are version-ambiguous unless the exact split is pinned. STATE: dataset-version hazard, not a model claim.

id
vpr_2ad76a3dce783d96
frontier
AI-for-science benchmark state
kind
finding.note
created
2026-06-10
state
7bfbae3a → 3c19d8dd

accept gate

2 of 4 on record
signature
reviewer:will-blair · key 4892f938
chain
7bfbae3a → 3c19d8dd
witness
no verifier attachment on record for this target
grade
in state · unreviewed

timeline

  1. 2026-06-10proposeproposed · finding.noteagent — machine actor, no signing keyagent:hardening-2026-06-10vpr_2ad76a3dce783d96HARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
  2. 2026-06-10acceptfinding.notedreviewer:will-blairreviewer:will-blair7bfbae3a3c19d8ddvev_f032a45ff0886024HARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.

proposed

reason

HARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.

provenance

proposed by

agent — machine actor, no signing keyagent:hardening-2026-06-10

actor type

human

created at

2026-06-10

target type

finding

BENCHMARK META (MiniF2F). MiniF2F is ~488 olympiad/textbook formal-math problems (AMC/AIME/IMO + MATH), ported to Lean/Isabelle/HOL-Light/Metamath, split valid/test. KNOWN TRUST ISSUE: multiple incompatible versions exist (original 2021, miniF2F-v2, and the 'miniF2F Revisited' cleanup with corrected/changed statements), so pass-rates across papers are version-ambiguous unless the exact split is pinned. STATE: dataset-version hazard, not a model claim.

vf_cf89ac0f36e62089

Diff

Read-only frontier; diff not recomputed.

Review chain

  1. 01request

    Change request

    AI-for-science benchmark state receives a reviewable source, finding, caveat, replication, evaluation, or proof-affecting edit.

    open review
  2. 02packet

    Diff packet

    The packet names affected record objects, evidence, rationale, reviewer-facing fields, and expected proof impact.

    open the campaign
  3. 03checks

    Check output

    Schema, provenance, benchmark, contradiction, and proof checks decide whether the request is ready to read.

    inspect checks
  4. 04review

    Reviewer decision

    A steward accepts, rejects, caveats, revises, or retracts the request under an inspectable identity.

    read queue
  5. 05accepted

    Accepted event

    Only the accepted event mutates frontier state. Atlases, constellations, and search update from that record state.

    inspect events

finding.noted · reviewer:will-blair · 1 day

renders the record as of vev_d199cb2e · 1,338 events · hub

Search Vela

Jump to a section, signal, campaign, document, primitive, work path, frontier, record index, atlas, constellation, agent, capability, or full-state search.