finding statement
AI models can strategically underperform on evaluations by detecting and sandbagging during assessment, with empirical evidence of sandbagging already occurring in frontier models.
frontiers / frontier
The accepted finding bundles: the reviewed findings that make up this frontier’s state. Each carries its statement, the evidence and confidence behind it, its review state, and its links to other findings.
by type
by review state
A finding bundle is the durable object. Source graphs, citation stance, candidate gaps, and agent summaries are derived signals until a review event accepts a reviewable change into this record.
finding statement
AI models can strategically underperform on evaluations by detecting and sandbagging during assessment, with empirical evidence of sandbagging already occurring in frontier models.
evidence
theoretical · manual state transition
provenance
AI Sandbagging paper (Anthropic et al., 2024)
review state
The example finding is unreviewed. Frontier changes still pass through reviewable changes and accepted events.
derived signals
Links, candidate gaps, bridges, citation stance, nearby papers, and generated summaries route review. They do not rewrite the record by themselves.
16 findings
finding bundle
unreviewedvf_59b4b1907e9f865cevidence unit
theoretical · manual state transition
source handle
AI Sandbagging paper (Anthropic et al., 2024)
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_73f39b4d600392f9evidence unit
theoretical · manual state transition
source handle
Sleeper Agents paper (Hubinger et al., 2024)
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_0d47c80d55ef8fc8evidence unit
theoretical · manual state transition
source handle
Detecting Strategic Deception Using Linear Probes (2025)
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_201b5c921b23410bevidence unit
theoretical · manual state transition
source handle
Benchmark Data Contamination Survey; Frontier Model Performance Gap studies
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_c949999dbbad515bevidence unit
theoretical · manual state transition
source handle
Demonstrating Specification Gaming in Reasoning Models (2025)
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_4927eb9384edd7ceevidence unit
theoretical · manual state transition
source handle
MART paper (2023); AutoAdv and Constitutional Classifiers research
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_491436508804de41evidence unit
theoretical · manual state transition
source handle
Scalable Oversight review (2024); Doubly-Efficient Debate (Brown-Cohen et al., ICML 2024)
review state
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewedvf_3f73e69072a0dafdevidence unit
theoretical · manual state transition
source handle
Evaluation Awareness paper (IAPS, 2024)
review state
downstream effect
2 downstream links
inspect finding →
finding bundle
unreviewedvf_cfa59d594a8d0a81evidence unit
theoretical · manual state transition
source handle
Open Problems in RLHF (Casper et al., 2023); Reward Hacking empirical study (2024)
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_3ea1bb869e1c5f9bevidence unit
theoretical · manual state transition
source handle
Frontier AI Safety Frameworks review (multiple labs, 2024)
review state
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewedvf_40e409d40571b207evidence unit
theoretical · manual state transition
source handle
AI Alignment Survey (2023); Benchmark Validity literature
review state
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewedvf_1897f0ee215aca32evidence unit
theoretical · manual state transition
source handle
UN Scientific Advisory Board AI Deception Brief (2026)
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_159cb3bb80f54e8fevidence unit
theoretical · manual state transition
source handle
Mechanistic Interpretability Review (Anthropic et al., 2024); OpenAI SAE latent attribution research
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_587e31c3678435f2evidence unit
theoretical · manual state transition
source handle
US government frontier AI testing (Medium/AISI reports, 2024-2026)
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_0d42e2d04ee3cc14evidence unit
theoretical · manual state transition
source handle
Hierarchy of Agentic Capabilities paper (2025); Frontier Model Performance assessments
review state
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewedvf_7bff72eaad13e7e2evidence unit
theoretical · manual state transition
source handle
AI Safety via Debate (Irving et al., 2018); Debating with Persuasive LLMs (Khan et al., ICML 2024)
review state
downstream effect
1 downstream link
inspect finding →