frontiers / frontier

AI-for-science benchmark state

id: vfr_efc649fd772a1ff1
license: CC-BY-4.0
findings: 12
accepted core: 12
contested: 0
links: 0
sources: 1
evidence: 12
avg conf: 0.30

used by 0 · replayed by 1 producer · second seat open

e24/24 · finding.noted · reviewer:will-blair · 2026-06-10 · 6c12→d02f

Evidence atom

back to sources

BENCHMARK CLAIM (MiniF2F) — Draft-Sketch-Prove (DSP) REPORTS improved miniF2F-test pass by drafting an informal proof, sketching a formal skeleton, then closing gaps with an ATP. VERIFICATION STATE: author-reported; pipeline described; depends on the underlying ATP and the autoformalizer, both of which drift. NOT re-run here. Open obligation: reproduce with pinned ATP + LLM versions.

id: vea_1c96c2bca6cabe6d
frontier: AI-for-science benchmark state
source: vs_066123dd29a9c5b4
finding: vf_368ec6ffb5747092

evidence boundary

unknown

computational

finding binding

bound

computational

BENCHMARK CLAIM (MiniF2F) — Draft-Sketch-Prove (DSP) REPORTS improved miniF2F-test pass by drafting an informal proof, sketching a formal skeleton, then closing gaps with an ATP. VERIFICATION STATE: author-reported; pipeline described; depends on the underlying ATP and the autoformalizer, both of which drift. NOT re-run here. Open obligation: reproduce with pinned ATP + LLM versions.

source binding

source-bound

manual finding

vs_066123dd29a9c5b4

review context

unverified

2 events

2 reviewable changes and 0 evaluation records target this atom or its bound objects.

statement

extraction method

manual_curation

support relation

unknown

condition refs

vcnd_1e6c4622a75133ac

caveats

missing evidence locator

Review, event, and evaluation records

events

vev_2e1f4d109d1a1f73finding.noted
HARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_e5d45a5605897295finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-06-10

reviewable changes

vpr_6a1b9f61788f93f0finding.note
HARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_f08818383df2a902finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10

evaluations

No evaluation rows are attached.

computational

computational

manual finding

2 events

Review, event, and evaluation records

Search Vela