AI-for-science benchmark state

id: vfr_efc649fd772a1ff1
license: CC-BY-4.0
findings: 12
accepted core: 12
contested: 0
links: 0
sources: 1
evidence: 12
avg conf: 0.30

used by 0 · replayed by 1 producer · second seat open

e24/24 · finding.noted · reviewer:will-blair · 2026-06-10 · 6c12→d02f

Brief & export

findings 12 · accepted 0 · open questions 0 · contested 0

strongest · none formally accepted

finding	confidence	source
BENCHMARK CLAIM (ProteinGym) — ESM-1v REPORTS strong zero-shot substitution performance from masked-marginal scoring of a protein language model. VERIFICATION STATE: author-reported; weights public; depends on the scoring convention (masked-marginal vs wt-marginal) and the ProteinGym version. NOT re-run here. Open obligation: re-score the released model on the pinned v1.1 zero-shot substitution set.	0.30	reviewer:will-blair
BENCHMARK CLAIM (ProteinGym) — Tranception + retrieval (and TranceptEVE, combining Tranception with the EVE family model) REPORT leading zero-shot Spearman by mixing an autoregressive PLM with MSA-derived statistics. VERIFICATION STATE: author-reported; MSA-dependent, so the number moves with the alignment pipeline. NOT re-run here. Open obligation: reproduce with the stated MSAs and depth.	0.30	reviewer:will-blair
BENCHMARK CLAIM (MiniF2F) — Draft-Sketch-Prove (DSP) REPORTS improved miniF2F-test pass by drafting an informal proof, sketching a formal skeleton, then closing gaps with an ATP. VERIFICATION STATE: author-reported; pipeline described; depends on the underlying ATP and the autoformalizer, both of which drift. NOT re-run here. Open obligation: reproduce with pinned ATP + LLM versions.	0.30	reviewer:will-blair
BENCHMARK CLAIM (ProteinGym) — ProteinNPT (non-parametric transformer, supervised track) REPORTS gains by attending across labelled neighbours. VERIFICATION STATE: author-reported; SUPERVISED — not comparable to zero-shot numbers; depends on the cross-validation split. NOT re-run here. Open obligation: re-run under the official supervised CV split; never compare against zero-shot rows.	0.30	reviewer:will-blair
BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements.	0.30	reviewer:will-blair
LEAKAGE HAZARD (ProteinGym). PLMs trained on UniProt may have seen sequences related to the assay proteins; 'zero-shot' is zero-shot on the LABELS, not necessarily on the SEQUENCES. VERIFICATION STATE: training-sequence overlap with assay proteins is under-audited. Open obligation: a sequence-similarity leakage audit between each model's training set and the assay proteins before banking any SOTA claim.	0.30	reviewer:will-blair
BENCHMARK CLAIM (MiniF2F) — HyperTree Proof Search (HTPS, Lample et al.) REPORTS a miniF2F pass rate via learned best-first proof search. VERIFICATION STATE: author-reported; search budget and version-specific. NOT re-run here. Open obligation: re-run at the stated budget on a pinned split.	0.30	reviewer:will-blair
BENCHMARK CLAIM (ProteinGym) — EVE (evolutionary VAE over an MSA) REPORTS strong variant-effect prediction, especially for clinical variants. VERIFICATION STATE: author-reported; fully MSA-dependent; per-protein model fitting. NOT re-run here. Open obligation: re-fit on pinned MSAs and confirm the held-out assay Spearman.	0.30	reviewer:will-blair
BENCHMARK META (MiniF2F). MiniF2F is ~488 olympiad/textbook formal-math problems (AMC/AIME/IMO + MATH), ported to Lean/Isabelle/HOL-Light/Metamath, split valid/test. KNOWN TRUST ISSUE: multiple incompatible versions exist (original 2021, miniF2F-v2, and the 'miniF2F Revisited' cleanup with corrected/changed statements), so pass-rates across papers are version-ambiguous unless the exact split is pinned. STATE: dataset-version hazard, not a model claim.	0.30	reviewer:will-blair
FAITHFULNESS HAZARD (MiniF2F). A reported 'solve' is only as good as the autoformalized statement matching the intended problem; the miniF2F Revisited effort found statements that were mis-stated or trivially true. VERIFICATION STATE: faithfulness of the FORMAL statement to the INFORMAL problem is the under-checked axis. Open obligation: every banked miniF2F solve needs a statement-faithfulness attestation (vela attest --scope formalism-fidelity).	0.30	reviewer:will-blair
BENCHMARK META (ProteinGym). ProteinGym benchmarks variant-effect prediction against deep mutational scanning (DMS) assays: a substitution benchmark (~217 assays) and an indel benchmark, with zero-shot and supervised tracks, scored by Spearman correlation (and AUC/MCC). KNOWN TRUST ISSUE: v1.0 vs v1.1 differ in assay set and splits; zero-shot vs supervised numbers are not comparable; MSA-dependent methods vary with the MSA pipeline. STATE: dataset-version + track-conflation hazard.	0.30	reviewer:will-blair
BENCHMARK CLAIM (MiniF2F) — AlphaProof (DeepMind) solved several IMO-2024 problems formally in Lean; this is sometimes conflated with a miniF2F number. VERIFICATION STATE: the IMO result is a separate, time-bounded competition claim, NOT a miniF2F-test pass rate; no public checkpoint. Open obligation: do not record an AlphaProof miniF2F figure without a cited, pinned evaluation.	0.30	reviewer:will-blair

bibliography · 1

manual finding

export

# AI-for-science benchmark state

This frontier holds 12 findings (0 accepted) over 1 sources.

## Significance

- BENCHMARK CLAIM (ProteinGym) — ESM-1v REPORTS strong zero-shot substitution performance from masked-marginal scoring of a protein language model. VERIFICATION STATE: author-reported; weights public; depends on the scoring convention (masked-marginal vs wt-marginal) and the ProteinGym version. NOT re-run here. Open obligation: re-score the released model on the pinned v1.1 zero-shot substitution set. (reviewer:will-blair)
- BENCHMARK CLAIM (ProteinGym) — Tranception + retrieval (and TranceptEVE, combining Tranception with the EVE family model) REPORT leading zero-shot Spearman by mixing an autoregressive PLM with MSA-derived statistics. VERIFICATION STATE: author-reported; MSA-dependent, so the number moves with the alignment pipeline. NOT re-run here. Open obligation: reproduce with the stated MSAs and depth. (reviewer:will-blair)
- BENCHMARK CLAIM (MiniF2F) — Draft-Sketch-Prove (DSP) REPORTS improved miniF2F-test pass by drafting an informal proof, sketching a formal skeleton, then closing gaps with an ATP. VERIFICATION STATE: author-reported; pipeline described; depends on the underlying ATP and the autoformalizer, both of which drift. NOT re-run here. Open obligation: reproduce with pinned ATP + LLM versions. (reviewer:will-blair)
- BENCHMARK CLAIM (ProteinGym) — ProteinNPT (non-parametric transformer, supervised track) REPORTS gains by attending across labelled neighbours. VERIFICATION STATE: author-reported; SUPERVISED — not comparable to zero-shot numbers; depends on the cross-validation split. NOT re-run here. Open obligation: re-run under the official supervised CV split; never compare against zero-shot rows. (reviewer:will-blair)
- BENCHMARK CLAIM (MiniF2F) — DeepSeek-Prover-V1.5 REPORTS a leading miniF2F-test pass rate under a large sampling budget (RMaxTS). VERIFICATION STATE: author-reported; model weights public; eval harness in the paper; dataset version = the team's stated split. NOT independently re-run in this frontier. Open obligation: pin the split, re-run the released checkpoint, audit train/test contamination of the formal statements. (reviewer:will-blair)
- LEAKAGE HAZARD (ProteinGym). PLMs trained on UniProt may have seen sequences related to the assay proteins; 'zero-shot' is zero-shot on the LABELS, not necessarily on the SEQUENCES. VERIFICATION STATE: training-sequence overlap with assay proteins is under-audited. Open obligation: a sequence-similarity leakage audit between each model's training set and the assay proteins before banking any SOTA claim. (reviewer:will-blair)
- BENCHMARK CLAIM (MiniF2F) — HyperTree Proof Search (HTPS, Lample et al.) REPORTS a miniF2F pass rate via learned best-first proof search. VERIFICATION STATE: author-reported; search budget and version-specific. NOT re-run here. Open obligation: re-run at the stated budget on a pinned split. (reviewer:will-blair)
- BENCHMARK CLAIM (ProteinGym) — EVE (evolutionary VAE over an MSA) REPORTS strong variant-effect prediction, especially for clinical variants. VERIFICATION STATE: author-reported; fully MSA-dependent; per-protein model fitting. NOT re-run here. Open obligation: re-fit on pinned MSAs and confirm the held-out assay Spearman. (reviewer:will-blair)
- BENCHMARK META (MiniF2F). MiniF2F is ~488 olympiad/textbook formal-math problems (AMC/AIME/IMO + MATH), ported to Lean/Isabelle/HOL-Light/Metamath, split valid/test. KNOWN TRUST ISSUE: multiple incompatible versions exist (original 2021, miniF2F-v2, and the 'miniF2F Revisited' cleanup with corrected/changed statements), so pass-rates across papers are version-ambiguous unless the exact split is pinned. STATE: dataset-version hazard, not a model claim. (reviewer:will-blair)
- FAITHFULNESS HAZARD (MiniF2F). A reported 'solve' is only as good as the autoformalized statement matching the intended problem; the miniF2F Revisited effort found statements that were mis-stated or trivially true. VERIFICATION STATE: faithfulness of the FORMAL statement to the INFORMAL problem is the under-checked axis. Open obligation: every banked miniF2F solve needs a statement-faithfulness attestation (vela attest --scope formalism-fidelity). (reviewer:will-blair)
- BENCHMARK META (ProteinGym). ProteinGym benchmarks variant-effect prediction against deep mutational scanning (DMS) assays: a substitution benchmark (~217 assays) and an indel benchmark, with zero-shot and supervised tracks, scored by Spearman correlation (and AUC/MCC). KNOWN TRUST ISSUE: v1.0 vs v1.1 differ in assay set and splits; zero-shot vs supervised numbers are not comparable; MSA-dependent methods vary with the MSA pipeline. STATE: dataset-version + track-conflation hazard. (reviewer:will-blair)
- BENCHMARK CLAIM (MiniF2F) — AlphaProof (DeepMind) solved several IMO-2024 problems formally in Lean; this is sometimes conflated with a miniF2F number. VERIFICATION STATE: the IMO result is a separate, time-bounded competition claim, NOT a miniF2F-test pass rate; no public checkpoint. Open obligation: do not record an AlphaProof miniF2F figure without a cited, pinned evaluation. (reviewer:will-blair)

Search Vela