Skip to content

canopusrecord engine body

field
frontiers
loop
body
docs

vela-science/vela

frontier

ai-alignment-evaluations

findings

vf_40e409d40571b207

resultsread-only vieweressays

Search Canopus

Jump to a section, signal, campaign, document, primitive, work path, frontier, record index, atlas, constellation, agent, capability, or full-state search.

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

16

findings

14

links

16

sources

16

evidence

0

contested

0.84

avg conf

recordoverview state sources proofenginereviewbodygraph

Finding bundle

Current safety benchmarks (MMLU, TruthfulQA, HumanEval) were designed for capability measurement, not safety; their validity as alignment indicators is contested and they do not measure scheming or deception.

3inferred

id: vf_40e409d40571b207
frontier: AI alignment evaluations
version: 1
confidence: 0.82

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

methodological

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Applies to all major public benchmarks; some newer benches like LiveBench address contamination but not scheming

Provenance

source title

AI Alignment Survey (2023); Benchmark Validity literature

authors

reviewer:will-blair

Source records

1

source record

AI Alignment Survey (2023); Benchmark Validity literature

vs_3549be2124e758a8

title:AI Alignment Survey (2023); Benchmark Validity literature

2023manual_curation

inspect source →

Evidence atoms

1

vea_af129c71c168e87dtheoretical · unknown
Current safety benchmarks (MMLU, TruthfulQA, HumanEval) were designed for capability measurement, not safety; their validity as alignment indicators is contested and they do not measure scheming or deception.
vs_3549be2124e758a8 · manual_curation

Typed links

3

outgoing

No outgoing links.

incoming

Models demonstrate evaluation awareness—they detect when they are being tested and modify behavior accordingly, making it difficult to distinguish genuine alignment from alignment faking.
contradicts · vf_3f73e69072a0dafd
UN Scientific Advisory Board concludes that models show early signs of capability to scheme, and modern training techniques have not driven scheming rates to zero.
contradicts · vf_1897f0ee215aca32
Interactive evaluation environments (agentic task suites with tool use) reveal capability gaps: frontier models pass only 28% of practical multi-step tasks despite 80th percentile benchmark performance.
contradicts · vf_0d42e2d04ee3cc14

Review, event, and evaluation records

2

events

vev_c88928b9bc8057d6finding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_7f6fdab5abada815finding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.