frontiers / frontier

AI-for-science benchmark state

id: vfr_efc649fd772a1ff1
license: CC-BY-4.0
findings: 12
accepted core: 12
contested: 0
links: 0
sources: 1
evidence: 12
avg conf: 0.30

used by 0 · replayed by 1 producer · second seat open

e24/24 · finding.noted · reviewer:will-blair · 2026-06-10 · 6c12→d02f

Finding bundle

back to state

BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.

id: vf_55068262f49df0ab
frontier: AI-for-science benchmark state
version: 1
confidence: 0.30

no incoming links yet

file

/frontier/benchmark-state#e=17scrub position · after_hash afc19db3f0499f21…

vf_55068262f49df0ab · benchmark-state · https://vela-site-next.fly.dev/frontier/benchmark-state#e=17cite

raw json · vf_55068262f49df0ab (2.7 KB)

machine twin · index.json

{
 "annotations": [
  {
   "author": "reviewer:will-blair",
   "id": "ann_a90a9a26863b2dc8",
   "text": "HARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.",
   "timestamp": "2026-06-10T23:01:45.084566+00:00"
  }
 ],
 "assertion": {
  "direction": null,
  "entities": [],
  "relation": null,
  "text": "BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.",
  "type": "computational"
 },
 "conditions": {
  "age_group": null,
  "cell_type": null,
  "clinical_trial": false,
  "concentration_range": null,
  "duration": null,
  "human_data": false,
  "in_vitro": false,
  "in_vivo": false,
  "species_unverified": [],
  "species_verified": [],
  "text": "Manually added finding; requires evidence review before scientific use."
 },
 "confidence": {
  "basis": "operator-supplied frontier prior; review required",
  "extraction_confidence": 1,
  "kind": "frontier_epistemic",
  "method": "expert_judgment",
  "score": 0.3
 },
 "created": "2026-06-10T06:50:55.829210+00:00",
 "evidence": {
  "effect_size": null,
  "evidence_spans": [],
  "method": "manual state transition",
  "model_system": "",
  "p_value": null,
  "replicated": false,
  "replication_count": null,
  "sample_size": null,
  "species": null,
  "type": "computational"
 },
 "flags": {
  "contested": false,
  "declining": false,
  "gap": true,
  "gravity_well": false,
  "negative_space": false,
  "retracted": false
 },
 "id": "vf_55068262f49df0ab",
 "links": [],
 "previous_version": null,
 "provenance": {
  "authors": [
   {
    "name": "reviewer:will-blair",
    "orcid": null
   }
  ],
  "citation_count": null,
  "doi": null,
  "extraction": {
   "extracted_at": "2026-06-10T06:50:55.829198+00:00",
   "extractor_version": "vela/0.691.0",
   "method": "manual_curation",
   "model": null,
   "model_version": null
  },
  "journal": null,
  "openalex_id": null,
  "pmc": null,
  "pmid": null,
  "review": {
   "corrections": [],
   "reviewed": false,
   "reviewed_at": null,
   "reviewer": null
  },
  "source_type": "expert_assertion",
  "title": "manual finding",
  "year": null
 },
 "updated": null,
 "version": 1
}

Unsealed — 0 attachment(s) on record, awaiting independent verification.

0 attachments · 0 distinct checker actors · 0 methods

blame · custody trail

produced byreviewer:will-blairfinding.asserted · 2026-06-10vev_b396d3a2727ae019

checked byno verifier attachment on record

accepted byno accept signed

history · 2 events

e2/24finding.assertedreviewer:will-blair2026-06-10→b381760051e17/24finding.notedreviewer:will-blair2026-06-10→afc19db3f0

record state

frontier-owned

Review status

unreviewed

finding statement

finding type

computational

No entity list is declared.

evidence

source-bound

1 atoms

computational · manual state transition

proof impact

packet context

2 events

2 reviewable changes and 0 evaluation records are attached to this finding id.

evidence

method

manual state transition

evidence type

computational

conditions

species_unverified
species_verified
text: Manually added finding; requires evidence review before scientific use.

provenance

source title

manual finding

authors

reviewer:will-blair

Source records

source record

declared

manual finding

vs_066123dd29a9c5b4

title:manual finding

manual_curation

Evidence atoms

vea_2ac0e43858a68cb9computational · unknown
BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.
vs_066123dd29a9c5b4 · manual_curation

Typed links

outgoing

No outgoing links.

incoming

No incoming links.

Review, event, and evaluation records

events

vev_804497a5a8fbe4a0finding.noted
HARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_b396d3a2727ae019finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-06-10

reviewable changes

vpr_f3a3a73919f9eb51finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_fb5a71c197133639finding.note
HARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10

evaluations

No evaluation record targets this finding id.

Review status

computational

1 atoms

2 events

Source records

manual finding

Evidence atoms

Typed links

Review, event, and evaluation records

Search Vela