source boundary
frontier-ownedfrontiers / frontier
AI-for-science benchmark state
- id
- vfr_efc649fd772a1ff1
- license
- CC-BY-4.0
- findings
- 12
- accepted core
- 12
- contested
- 0
- links
- 0
- sources
- 1
- evidence
- 12
- avg conf
- 0.30
e24/24 · finding.noted · reviewer:will-blair · 2026-06-10 · 6c12→d02f
Source record
back to sourcesmanual finding
- id
- vs_066123dd29a9c5b4
- frontier
- AI-for-science benchmark state
- type
- paper
finding bindings
record context12 findings
evidence atoms
materialized12 atoms
review context
inspectable24 events
24 reviewable changes and 0 evaluations are attached through this source or its findings.
citation
locator
title:manual finding
imported
2026-06-10T06:50:55.899492+00:00
extraction mode
manual_curation
authors
reviewer:will-blair
caveats
No caveats recorded.
Bound findings
12- BENCHMARK CLAIM (ProteinGym) — ESM-1v REPORTS strong zero-shot substitution performance from masked-marginal scoring of a protein language model. VERIFICATION STATE: author-reported; weights public; depends on the scoring convention (masked-marginal vs wt-marginal) and the ProteinGym version. NOT re-run here. Open obligation: re-score the released model on the pinned v1.1 zero-shot substitution set.
computational ·
vf_03776d6cd3e0801b - BENCHMARK CLAIM (ProteinGym) — Tranception + retrieval (and TranceptEVE, combining Tranception with the EVE family model) REPORT leading zero-shot Spearman by mixing an autoregressive PLM with MSA-derived statistics. VERIFICATION STATE: author-reported; MSA-dependent, so the number moves with the alignment pipeline. NOT re-run here. Open obligation: reproduce with the stated MSAs and depth.
computational ·
vf_170c9a0e01a9b1d3 - BENCHMARK CLAIM (MiniF2F) — Draft-Sketch-Prove (DSP) REPORTS improved miniF2F-test pass by drafting an informal proof, sketching a formal skeleton, then closing gaps with an ATP. VERIFICATION STATE: author-reported; pipeline described; depends on the underlying ATP and the autoformalizer, both of which drift. NOT re-run here. Open obligation: reproduce with pinned ATP + LLM versions.
computational ·
vf_368ec6ffb5747092 - BENCHMARK CLAIM (ProteinGym) — ProteinNPT (non-parametric transformer, supervised track) REPORTS gains by attending across labelled neighbours. VERIFICATION STATE: author-reported; SUPERVISED — not comparable to zero-shot numbers; depends on the cross-validation split. NOT re-run here. Open obligation: re-run under the official supervised CV split; never compare against zero-shot rows.
computational ·
vf_41030d44f59eae22 - BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.
computational ·
vf_55068262f49df0ab - LEAKAGE HAZARD (ProteinGym). PLMs trained on UniProt may have seen sequences related to the assay proteins; 'zero-shot' is zero-shot on the LABELS, not necessarily on the SEQUENCES. VERIFICATION STATE: training-sequence overlap with assay proteins is under-audited. Open obligation: a sequence-similarity leakage audit between each model's training set and the assay proteins before banking any SOTA claim.
computational ·
vf_8212daf3d7034a93 - BENCHMARK CLAIM (MiniF2F) — HyperTree Proof Search (HTPS, Lample et al.) REPORTS a miniF2F pass rate via learned best-first proof search. VERIFICATION STATE: author-reported; search budget and version-specific. NOT re-run here. Open obligation: re-run at the stated budget on a pinned split.
computational ·
vf_9a454a597ddee070 - BENCHMARK CLAIM (ProteinGym) — EVE (evolutionary VAE over an MSA) REPORTS strong variant-effect prediction, especially for clinical variants. VERIFICATION STATE: author-reported; fully MSA-dependent; per-protein model fitting. NOT re-run here. Open obligation: re-fit on pinned MSAs and confirm the held-out assay Spearman.
computational ·
vf_cc50639072ba1867 - BENCHMARK META (MiniF2F). MiniF2F is ~488 olympiad/textbook formal-math problems (AMC/AIME/IMO + MATH), ported to Lean/Isabelle/HOL-Light/Metamath, split valid/test. KNOWN TRUST ISSUE: multiple incompatible versions exist (original 2021, miniF2F-v2, and the 'miniF2F Revisited' cleanup with corrected/changed statements), so pass-rates across papers are version-ambiguous unless the exact split is pinned. STATE: dataset-version hazard, not a model claim.
computational ·
vf_cf89ac0f36e62089 - FAITHFULNESS HAZARD (MiniF2F). A reported 'solve' is only as good as the autoformalized statement matching the intended problem; the miniF2F Revisited effort found statements that were mis-stated or trivially true. VERIFICATION STATE: faithfulness of the FORMAL statement to the INFORMAL problem is the under-checked axis. Open obligation: every banked miniF2F solve needs a statement-faithfulness attestation (vela attest --scope formalism-fidelity).
computational ·
vf_dce7a34adf2878f2 - BENCHMARK META (ProteinGym). ProteinGym benchmarks variant-effect prediction against deep mutational scanning (DMS) assays: a substitution benchmark (~217 assays) and an indel benchmark, with zero-shot and supervised tracks, scored by Spearman correlation (and AUC/MCC). KNOWN TRUST ISSUE: v1.0 vs v1.1 differ in assay set and splits; zero-shot vs supervised numbers are not comparable; MSA-dependent methods vary with the MSA pipeline. STATE: dataset-version + track-conflation hazard.
computational ·
vf_ec4bb8feca206bf2 - BENCHMARK CLAIM (MiniF2F) — AlphaProof (DeepMind) solved several IMO-2024 problems formally in Lean; this is sometimes conflated with a miniF2F number. VERIFICATION STATE: the IMO result is a separate, time-bounded competition claim, NOT a miniF2F-test pass rate; no public checkpoint. Open obligation: do not record an AlphaProof miniF2F figure without a cited, pinned evaluation.
computational ·
vf_fec6f956d525e753
Evidence atoms
12- vea_107364fe31419d2dcomputational · unknown
BENCHMARK META (MiniF2F). MiniF2F is ~488 olympiad/textbook formal-math problems (AMC/AIME/IMO + MATH), ported to Lean/Isabelle/HOL-Light/Metamath, split valid/test. KNOWN TRUST ISSUE: multiple incompatible versions exist (original 2021, miniF2F-v2, and the 'miniF2F Revisited' cleanup with corrected/changed statements), so pass-rates across papers are version-ambiguous unless the exact split is pinned. STATE: dataset-version hazard, not a model claim.
- vea_19a1954be8602f06computational · unknown
BENCHMARK CLAIM (ProteinGym) — Tranception + retrieval (and TranceptEVE, combining Tranception with the EVE family model) REPORT leading zero-shot Spearman by mixing an autoregressive PLM with MSA-derived statistics. VERIFICATION STATE: author-reported; MSA-dependent, so the number moves with the alignment pipeline. NOT re-run here. Open obligation: reproduce with the stated MSAs and depth.
- vea_1c96c2bca6cabe6dcomputational · unknown
BENCHMARK CLAIM (MiniF2F) — Draft-Sketch-Prove (DSP) REPORTS improved miniF2F-test pass by drafting an informal proof, sketching a formal skeleton, then closing gaps with an ATP. VERIFICATION STATE: author-reported; pipeline described; depends on the underlying ATP and the autoformalizer, both of which drift. NOT re-run here. Open obligation: reproduce with pinned ATP + LLM versions.
- vea_2ac0e43858a68cb9computational · unknown
BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.
- vea_3818a2b502e64c42computational · unknown
BENCHMARK CLAIM (ProteinGym) — ProteinNPT (non-parametric transformer, supervised track) REPORTS gains by attending across labelled neighbours. VERIFICATION STATE: author-reported; SUPERVISED — not comparable to zero-shot numbers; depends on the cross-validation split. NOT re-run here. Open obligation: re-run under the official supervised CV split; never compare against zero-shot rows.
- vea_4ac62ba55a4b8dbccomputational · unknown
BENCHMARK CLAIM (MiniF2F) — HyperTree Proof Search (HTPS, Lample et al.) REPORTS a miniF2F pass rate via learned best-first proof search. VERIFICATION STATE: author-reported; search budget and version-specific. NOT re-run here. Open obligation: re-run at the stated budget on a pinned split.
- vea_701b8b3ab51f97afcomputational · unknown
BENCHMARK CLAIM (ProteinGym) — EVE (evolutionary VAE over an MSA) REPORTS strong variant-effect prediction, especially for clinical variants. VERIFICATION STATE: author-reported; fully MSA-dependent; per-protein model fitting. NOT re-run here. Open obligation: re-fit on pinned MSAs and confirm the held-out assay Spearman.
- vea_78b5ea08545ad462computational · unknown
BENCHMARK CLAIM (MiniF2F) — AlphaProof (DeepMind) solved several IMO-2024 problems formally in Lean; this is sometimes conflated with a miniF2F number. VERIFICATION STATE: the IMO result is a separate, time-bounded competition claim, NOT a miniF2F-test pass rate; no public checkpoint. Open obligation: do not record an AlphaProof miniF2F figure without a cited, pinned evaluation.
- vea_874a2c26d4e672f2computational · unknown
LEAKAGE HAZARD (ProteinGym). PLMs trained on UniProt may have seen sequences related to the assay proteins; 'zero-shot' is zero-shot on the LABELS, not necessarily on the SEQUENCES. VERIFICATION STATE: training-sequence overlap with assay proteins is under-audited. Open obligation: a sequence-similarity leakage audit between each model's training set and the assay proteins before banking any SOTA claim.
- vea_a9c4e7b494465d60computational · unknown
BENCHMARK CLAIM (ProteinGym) — ESM-1v REPORTS strong zero-shot substitution performance from masked-marginal scoring of a protein language model. VERIFICATION STATE: author-reported; weights public; depends on the scoring convention (masked-marginal vs wt-marginal) and the ProteinGym version. NOT re-run here. Open obligation: re-score the released model on the pinned v1.1 zero-shot substitution set.
- vea_ad36ad1c4b0f546fcomputational · unknown
BENCHMARK META (ProteinGym). ProteinGym benchmarks variant-effect prediction against deep mutational scanning (DMS) assays: a substitution benchmark (~217 assays) and an indel benchmark, with zero-shot and supervised tracks, scored by Spearman correlation (and AUC/MCC). KNOWN TRUST ISSUE: v1.0 vs v1.1 differ in assay set and splits; zero-shot vs supervised numbers are not comparable; MSA-dependent methods vary with the MSA pipeline. STATE: dataset-version + track-conflation hazard.
- vea_c7b4329c7e40cedccomputational · unknown
FAITHFULNESS HAZARD (MiniF2F). A reported 'solve' is only as good as the autoformalized statement matching the intended problem; the miniF2F Revisited effort found statements that were mis-stated or trivially true. VERIFICATION STATE: faithfulness of the FORMAL statement to the INFORMAL problem is the under-checked axis. Open obligation: every banked miniF2F solve needs a statement-faithfulness attestation (vela attest --scope formalism-fidelity).
Review, event, and evaluation records
48events
vev_03b2b7f5e7e0be96finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_07e40e45981061adfinding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_270eaf05963c65dffinding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_2e1f4d109d1a1f73finding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_31b40ef5e25c88b6finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_39ad7234d713069afinding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_4869af225af70848finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_4e2e2a5f25a8e28ffinding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_5064c841d30e8aaffinding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_5a33eaff97407ac8finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_804497a5a8fbe4a0finding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_a73023eb43fa7387finding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_b11de7b18f8b9f24finding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_b396d3a2727ae019finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_bb98a228e1f4a5a7finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_bd0ec86a1be50d66finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_c4da9db8be63634efinding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_c75d1f9984ddd3abfinding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_d199cb2e417c4f42finding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_e5d45a5605897295finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_eb4221b9d34b54cdfinding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_f032a45ff0886024finding.notedHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
reviewer:will-blair · 2026-06-10
vev_f17f5a864754e2a0finding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
vev_fc4e6c758136cecdfinding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-06-10
reviewable changes
vpr_01ce4a9a77a73640finding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_212f7910ba3adf91finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_2ad76a3dce783d96finding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_66758152772dd461finding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_684dfb8e796321d2finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_6a1b9f61788f93f0finding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_74c27456b2783c8dfinding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_7d99f50cb897566dfinding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_8ebb01be4aedad3bfinding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_9496dacae43645bcfinding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_965bbdde5ff53044finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_adea2a2f9e4ba533finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_be9c7dcdf52b3be5finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_c555bef607043399finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_ce433e03bf245f79finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_cf4939a974d7a904finding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_d1d96c52036f153dfinding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_d3f3228bb463c2d9finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_e4bebe83c4ec21ebfinding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_edef714318aa82befinding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_f08818383df2a902finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_f3a3a73919f9eb51finding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-06-10
vpr_fb5a71c197133639finding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
vpr_fd307d5d15c2cef7finding.noteHARDENING (benchmark-state): label_provenance=attested (records-not-reruns; ground truth is an answer key, not a frozen-verifier rederivation), valid_as_of=2026-06-10, model_cutoff=unknown. Under the trust ladder, attested label provenance caps this record below 'verified' until a deterministic rederivation exists.
applied · agent:hardening-2026-06-10 · 2026-06-10
evaluations
No evaluation rows are attached.