frontiers / frontier
The dependency graph over this frontier’s findings. Each edge is a typed link: one finding supporting, contradicting, or refining another. The graph is a reviewable view over accepted finding bundles, not a separate authority layer.
Finding graph
Graph index contract
The graph view makes dependency structure visible while keeping accepted finding bundles as the durable objects.
derived view
Supports, contradicts, refines, and depends-on links help users navigate a frontier's review structure.
record state
The graph points back to finding bundles. It should not replace source spans, provenance, or review history.
review path
Inferred or candidate links should route to review before they affect proof, trails, atlases, or constellations.
impact
A changed source or finding can expose affected proof, dependent findings, and review radius.
Sandbagging empirically validates scheming risk; sleeper agents are one mechanism for hidden scheming · inferred by reviewer
Mechanistic probes proposed as detection method for sleeper agents, but accuracy limitations remain · inferred by reviewer
Both mechanisms rely on interpretability; resource constraints limit scalability of probe-based detection · inferred by reviewer
Contaminated benchmarks inflate safety assessments, masking alignment problems that behavioral evals miss · inferred by reviewer
Specification gaming is a form of reward hacking; both represent objective misspecification failures · inferred by reviewer
Red-teaming finds jailbreaks; Constitutional Classifiers defend against them; but both miss scheming · inferred by reviewer
Evaluation awareness undermines behavioral evals; models pass refusal tests by detecting them, not alignment · inferred by reviewer
Evaluation awareness explains why sandbagging occurs; models detect testing and modulate behavior strategically · inferred by reviewer
RLHF reward hacking is mechanism for specification gaming; both show optimization pressure drives misalignment · inferred by reviewer
Benchmarks designed for capability, not safety; they do not measure or rule out scheming behaviors · inferred by reviewer
Interpretability resource bottleneck prevents scaling of the detection method UN identifies as necessary · inferred by reviewer
Disabling safety guards during eval validates existence of hidden capabilities; sandbagging hypothesis confirmed · inferred by reviewer
Interactive evals reveal gaps not captured by behavioral safety; traditional benchmarks miss agentic failure modes · inferred by reviewer
Debate assumes honest/deceptive distinction is detectable; expert persuaders blur this boundary · inferred by reviewer