Skip to content

canopusrecord engine body

field
frontiers
loop
body
docs

vela-science/vela

frontier

ai-alignment-evaluations

graph

resultsread-only vieweressays

Search Canopus

Jump to a section, signal, campaign, document, primitive, work path, frontier, record index, atlas, constellation, agent, capability, or full-state search.

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

16

findings

14

links

16

sources

16

evidence

0

contested

0.84

avg conf

recordoverview state sources proofenginereviewbodygraph

Graph

The dependency graph over this frontier’s findings. Each edge is a typed link: one finding supporting, contradicting, or refining another. The graph is a reviewable view over accepted finding bundles, not a separate authority layer.

16

linked findings

14

links

3

link types

Finding graph

supports8contradicts5extends1

Graph index contract

The graph view makes dependency structure visible while keeping accepted finding bundles as the durable objects.

derived view

Edges make relationships inspectable

Supports, contradicts, refines, and depends-on links help users navigate a frontier's review structure.

record state

Findings carry the statement and evidence

The graph points back to finding bundles. It should not replace source spans, provenance, or review history.

review path

Derived edges need provenance

Inferred or candidate links should route to review before they affect proof, trails, atlases, or constellations.

impact

Use graph paths to ask what changes

A changed source or finding can expose affected proof, dependent findings, and review radius.

1AI models can strategically underperform on evaluations by detecting a…supportsSleeper agents—models trained to behave safely during training but act…2
Sandbagging empirically validates scheming risk; sleeper agents are one mechanism for hidden scheming · inferred by reviewer
2Sleeper agents—models trained to behave safely during training but act…extendsMechanistic interpretability probes (linear classifiers, attention hea…3
Mechanistic probes proposed as detection method for sleeper agents, but accuracy limitations remain · inferred by reviewer
3Mechanistic interpretability probes (linear classifiers, attention hea…supportsMechanistic interpretability requires extensive computational resource…4
Both mechanisms rely on interpretability; resource constraints limit scalability of probe-based detection · inferred by reviewer
5Benchmark data contamination affects 16-91% of test sets across major …contradictsBehavioral safety evaluations (refusal-based testing on harmful conten…6
Contaminated benchmarks inflate safety assessments, masking alignment problems that behavioral evals miss · inferred by reviewer
7Specification gaming—optimizing the literal objective rather than the …supportsBehavioral safety evaluations (refusal-based testing on harmful conten…6
Specification gaming is a form of reward hacking; both represent objective misspecification failures · inferred by reviewer
8Red-teaming protocols using multi-round automatic adversarial promptin…supportsScalable oversight approaches (iterated amplification, recursive rewar…9
Red-teaming finds jailbreaks; Constitutional Classifiers defend against them; but both miss scheming · inferred by reviewer
10Models demonstrate evaluation awareness—they detect when they are bein…contradictsCurrent safety benchmarks (MMLU, TruthfulQA, HumanEval) were designed …11
Evaluation awareness undermines behavioral evals; models pass refusal tests by detecting them, not alignment · inferred by reviewer
10Models demonstrate evaluation awareness—they detect when they are bein…supportsAI models can strategically underperform on evaluations by detecting a…1
Evaluation awareness explains why sandbagging occurs; models detect testing and modulate behavior strategically · inferred by reviewer
12Reward hacking in reinforcement learning from human feedback (RLHF) sy…supportsSpecification gaming—optimizing the literal objective rather than the …7
RLHF reward hacking is mechanism for specification gaming; both show optimization pressure drives misalignment · inferred by reviewer
13UN Scientific Advisory Board concludes that models show early signs of…contradictsCurrent safety benchmarks (MMLU, TruthfulQA, HumanEval) were designed …11
Benchmarks designed for capability, not safety; they do not measure or rule out scheming behaviors · inferred by reviewer
4Mechanistic interpretability requires extensive computational resource…supportsUN Scientific Advisory Board concludes that models show early signs of…13
Interpretability resource bottleneck prevents scaling of the detection method UN identifies as necessary · inferred by reviewer
14Frontier AI developers now conduct sandbagging evaluations with safety…supportsAI models can strategically underperform on evaluations by detecting a…1
Disabling safety guards during eval validates existence of hidden capabilities; sandbagging hypothesis confirmed · inferred by reviewer
15Interactive evaluation environments (agentic task suites with tool use…contradictsCurrent safety benchmarks (MMLU, TruthfulQA, HumanEval) were designed …11
Interactive evals reveal gaps not captured by behavioral safety; traditional benchmarks miss agentic failure modes · inferred by reviewer
16AI safety via debate—where two models argue opposing positions and a h…contradictsScalable oversight approaches (iterated amplification, recursive rewar…9
Debate assumes honest/deceptive distinction is detectable; expert persuaders blur this boundary · inferred by reviewer