AI-for-science benchmark state

id: vfr_efc649fd772a1ff1
license: CC-BY-4.0
findings: 12
accepted core: 12
contested: 0
links: 0
sources: 1
evidence: 12
avg conf: 0.30

used by 0 · replayed by 1 producer · second seat open

e24/24 · finding.noted · reviewer:will-blair · 2026-06-10 · 6c12→d02f

Reviewable change

add a finding

accepted

BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.

id: vpr_f3a3a73919f9eb51
frontier: AI-for-science benchmark state
kind: finding.add
created: 2026-06-10
findings: +1
state: null → b3817600

accept gate

2 of 4 on record

✓
signature: reviewer:will-blair · key 4892f938
✓
chain: null → b3817600
—
witness: no verifier attachment on record for this target
—
grade: in state · unreviewed

timeline

2026-06-10proposeproposed · finding.addreviewer:will-blairvpr_f3a3a73919f9eb51Manual finding added to frontier state
2026-06-10acceptfinding.assertedreviewer:will-blairnull→b3817600vev_b396d3a2727ae019Manual finding added to frontier state

proposed

reason

Manual finding added to frontier state

finding type

computational

proposed confidence

0.30

confidence basis

operator-supplied frontier prior; review required

provenance

proposed by

reviewer:will-blair

actor type

human

created at

2026-06-10

target type

finding

affected

inspect finding →

vf_55068262f49df0ab

Diff

Read-only frontier; diff not recomputed.

Review chain

01request
Change request
AI-for-science benchmark state receives a reviewable source, finding, caveat, replication, evaluation, or proof-affecting edit.
open review →
02packet
Diff packet
The packet names affected record objects, evidence, rationale, reviewer-facing fields, and expected proof impact.
open the campaign →
03checks
Check output
Schema, provenance, benchmark, contradiction, and proof checks decide whether the request is ready to read.
inspect checks →
04review
Reviewer decision
A steward accepts, rejects, caveats, revises, or retracts the request under an inspectable identity.
read queue →
05accepted
Accepted event
Only the accepted event mutates frontier state. Atlases, constellations, and search update from that record state.
inspect events →

Change request

Diff packet

Check output

Reviewer decision

Accepted event

Search Vela