evidence boundary
unknownfrontiers / frontier
AI-for-science benchmark state
- id
- vfr_efc649fd772a1ff1
- license
- CC-BY-4.0
- findings
- 12
- accepted core
- 12
- contested
- 0
- links
- 0
- sources
- 1
- evidence
- 12
- avg conf
- 0.30
e24/24 · finding.noted · reviewer:will-blair · 2026-06-10 · 6c12→d02f
Evidence atom
back to sourcesBENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.
- id
- vea_2ac0e43858a68cb9
- frontier
- AI-for-science benchmark state
- source
- vs_066123dd29a9c5b4
- finding
- vf_55068262f49df0ab
finding binding
boundcomputational
BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.
source binding
source-boundmanual finding
vs_066123dd29a9c5b4
review context
unverified2 events
2 reviewable changes and 0 evaluation records target this atom or its bound objects.
statement
BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.
extraction method
manual_curation
support relation
unknown
condition refs
vcnd_80603ce3c91d9931
caveats
- missing evidence locator
Review, event, and evaluation records
4events
vev_804497a5a8fbe4a0finding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_b396d3a2727ae019finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
reviewable changes
vpr_f3a3a73919f9eb51finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_fb5a71c197133639finding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
evaluations
No evaluation rows are attached.