record state
frontier-ownedfrontiers / frontier
AI-for-science benchmark state
- id
- vfr_efc649fd772a1ff1
- license
- CC-BY-4.0
- findings
- 12
- accepted core
- 12
- contested
- 0
- links
- 0
- sources
- 1
- evidence
- 12
- avg conf
- 0.30
e24/24 · finding.noted · reviewer:will-blair · 2026-06-10 · 6c12→d02f
Finding bundle
back to stateBENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.
- id
- vf_55068262f49df0ab
- frontier
- AI-for-science benchmark state
- version
- 1
- confidence
- 0.30
no incoming links yet
file
/frontier/benchmark-state#e=17scrub position · after_hash afc19db3f0499f21…vf_55068262f49df0ab · benchmark-state · https://vela-site-next.fly.dev/frontier/benchmark-state#e=17citeraw json · vf_55068262f49df0ab (2.7 KB)
{
"annotations": [
{
"author": "reviewer:will-blair",
"id": "ann_a90a9a26863b2dc8",
"text": "HARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.",
"timestamp": "2026-06-10T23:01:45.084566+00:00"
}
],
"assertion": {
"direction": null,
"entities": [],
"relation": null,
"text": "BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.",
"type": "computational"
},
"conditions": {
"age_group": null,
"cell_type": null,
"clinical_trial": false,
"concentration_range": null,
"duration": null,
"human_data": false,
"in_vitro": false,
"in_vivo": false,
"species_unverified": [],
"species_verified": [],
"text": "Manually added finding; requires evidence review before scientific use."
},
"confidence": {
"basis": "operator-supplied frontier prior; review required",
"extraction_confidence": 1,
"kind": "frontier_epistemic",
"method": "expert_judgment",
"score": 0.3
},
"created": "2026-06-10T06:50:55.829210+00:00",
"evidence": {
"effect_size": null,
"evidence_spans": [],
"method": "manual state transition",
"model_system": "",
"p_value": null,
"replicated": false,
"replication_count": null,
"sample_size": null,
"species": null,
"type": "computational"
},
"flags": {
"contested": false,
"declining": false,
"gap": true,
"gravity_well": false,
"negative_space": false,
"retracted": false
},
"id": "vf_55068262f49df0ab",
"links": [],
"previous_version": null,
"provenance": {
"authors": [
{
"name": "reviewer:will-blair",
"orcid": null
}
],
"citation_count": null,
"doi": null,
"extraction": {
"extracted_at": "2026-06-10T06:50:55.829198+00:00",
"extractor_version": "vela/0.691.0",
"method": "manual_curation",
"model": null,
"model_version": null
},
"journal": null,
"openalex_id": null,
"pmc": null,
"pmid": null,
"review": {
"corrections": [],
"reviewed": false,
"reviewed_at": null,
"reviewer": null
},
"source_type": "expert_assertion",
"title": "manual finding",
"year": null
},
"updated": null,
"version": 1
}Unsealed — 0 attachment(s) on record, awaiting independent verification.
0 attachments · 0 distinct checker actors · 0 methods
blame · custody trail
vev_b396d3a2727ae019history · 2 events
finding statement
finding typecomputational
No entity list is declared.
evidence
source-bound1 atoms
computational · manual state transition
proof impact
packet context2 events
2 reviewable changes and 0 evaluation records are attached to this finding id.
evidence
method
manual state transition
evidence type
computational
conditions
- species_unverified
- species_verified
- text
- Manually added finding; requires evidence review before scientific use.
provenance
source title
manual finding
authors
reviewer:will-blair
Source records
1Evidence atoms
1- vea_2ac0e43858a68cb9computational · unknown
BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.
vs_066123dd29a9c5b4 · manual_curation
Typed links
0outgoing
No outgoing links.
incoming
No incoming links.
Review, event, and evaluation records
4events
vev_804497a5a8fbe4a0finding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_b396d3a2727ae019finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
reviewable changes
vpr_f3a3a73919f9eb51finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_fb5a71c197133639finding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
evaluations
No evaluation record targets this finding id.