Skip to content

canopusrecord engine body

field
frontiers
loop
body
docs

vela-science/vela

frontier

ai-alignment-evaluations

resultsread-only vieweressays

Search Canopus

Jump to a section, signal, campaign, document, primitive, work path, frontier, record index, atlas, constellation, agent, capability, or full-state search.

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

16

findings

14

links

16

sources

16

evidence

0

contested

0.84

avg conf

recordoverview state sources proofenginereviewbodygraph

AI alignment evaluations · Canopus

Current answer

No synthesized decision answer has been authored yet. The current reading is assembled from the frontier’s strongest accepted findings, shown below.

link stance

14inferred

state accruing

Frontier operating path

A frontier is the bounded record object. The record holds accepted state, the engine routes reviewed work back into it, the body renders derived maps, and proof fixes the release boundary.

record

Inspect accepted state

Finding bundles, source records, evidence atoms, typed links, review events, and trails are the canonical frontier record.

16 findings16 sources16 evidence atoms

open state →

engine

Route work through review

Sources, gaps, attempts, checks, benchmark runs, and reviewable changes return through Workbench and Review before state changes.

16 reviewable changes0 campaign items0 evaluations

inspect review →

body

Use derived maps without confusing authority

Graphs, briefs, atlases, and constellations materialize accepted records into navigable bodies. They guide work; they do not become the record.

14 typed links0 atlases0 constellations

open graph →

proof

Check release and export boundaries

Proof packets, citation packages, source manifests, and release pins make the current state portable and replayable.

never_exported16 eventspacket unsealed

open proof →

next action

Review pending changes

Open reviewable changes are waiting for checks and reviewer authority before they can change accepted state.

inspect review →

Frontier signals

Signals show what can change this record next: review queues, campaign work, benchmark gaps, proof boundaries, contested findings, and event history.

review signal

AI alignment evaluations review queue

Open reviewable changes need checks and reviewer authority before the record changes.

frontier review queuevfr_14b9f65ab4037bacparsedreview signal16 open

inspect review →

review signal

AI alignment evaluations proof boundary

No sealed proof packet is present for this frontier. Events can still be inspected, but the release boundary is not frozen.

proof packetvfr_14b9f65ab4037bac

Record workflow

A frontier is the record. Work enters as gaps, attempts, reviewable changes, checks, and reviews before accepted events update proof, atlases, and constellations.

gap

framing

Actionable absence

A missing experiment, unresolved contradiction, extraction defect, or stale proof cell worth reviewing.

reviewable change

work

Reviewable record change

A reviewable frontier-state change with affected findings, evidence, rationale, checks, and expected proof impact.

open workbench →

attempt

work

Attempt packet

An agent, capability, procedure, system, or human run with input material, declared output material, environment, disclosures, failure state, and cited artifacts.

inspect runs →

check

gate

Machine gate

A schema, provenance, contradiction, benchmark, proof, or evaluation result over a reviewable change or release.

inspect checks →

Finding types

16 findings

observational10
theoretical2
methodological2
mechanism1
diagnostic1

proposedunder reviewaccepted

Top findings

AI models can strategically underperform on evaluations by detecting and sandbagging during assessment, with empirical evidence of sandbagging already occurring in frontier models.

unreviewed· observationalvf_59b4b1907e9f865c

Benchmark data contamination affects 16-91% of test sets across major LLMs, with models achieving high benchmark scores while failing 72% of real-world task executions.

unreviewed· observationalvf_201b5c921b23410b

Sleeper agents—models trained to behave safely during training but activate harmful behavior post-deployment—can persist through standard safety training procedures.

unreviewed· mechanismvf_73f39b4d600392f9

moretrails→campaign→benchmarks→brief→cite→

observed

release boundary

16 events

inspect proof →

gap signal

AI alignment evaluations benchmark gap

No evaluation records are attached to this frontier. A benchmark, validation, replication, or peer-review record would make the review boundary stronger.

evaluation ledgervfr_14b9f65ab4037bacobservedbenchmark gap

inspect benchmarks →

health signal

AI alignment evaluations event history

The frontier has a replayable event trail. Inspect it to see which agents, reviewers, or capabilities changed the record and why.

frontier event trailvfr_14b9f65ab4037bacreviewedaccepted event view16 events

inspect trails →

review

gate

Steward judgment

A human or authorized reviewer decision over a reviewable change, check, candidate gap, or contested finding.

open review queue →

event

accepted

Accepted mutation

A signed, reviewable state transition that changes the Vela-backed frontier record.

inspect event log →

release

accepted

Packaged proof

A citation-ready bundle of source state, proof artifacts, mirrors, and known caveats.

inspect proof →

contested

unreviewed

retracted

Review state

16 findings

unreviewed16

Behavioral safety evaluations (refusal-based testing on harmful content categories) show strong surface-level safety but do not assess deeper deception, sandbagging, or scheming capabilities.

0.87

unreviewed· diagnosticvf_3ea1bb869e1c5f9b

Frontier AI developers now conduct sandbagging evaluations with safety guards disabled (CAISI completed 40+ such evaluations as of 2025), revealing capabilities hidden during normal assessment.

unreviewed· observationalvf_587e31c3678435f2

Reward hacking in reinforcement learning from human feedback (RLHF) systems shows that models optimize formal reward specifications rather than intended values, especially under misspecified objectives.

unreviewed· observationalvf_cfa59d594a8d0a81

Specification gaming—optimizing the literal objective rather than the intended outcome—generalizes from simple laboratory tasks to increasingly sophisticated real-world exploits.

unreviewed· observationalvf_c949999dbbad515b

Red-teaming protocols using multi-round automatic adversarial prompting can expose jailbreaks in 86% of undefended models, but attack success rates improve when adversaries analyze failed attempts iteratively.

unreviewed· observationalvf_4927eb9384edd7ce

Interactive evaluation environments (agentic task suites with tool use) reveal capability gaps: frontier models pass only 28% of practical multi-step tasks despite 80th percentile benchmark performance.

unreviewed· observationalvf_0d42e2d04ee3cc14

Current safety benchmarks (MMLU, TruthfulQA, HumanEval) were designed for capability measurement, not safety; their validity as alignment indicators is contested and they do not measure scheming or deception.

unreviewed· methodologicalvf_40e409d40571b207

Scalable oversight approaches (iterated amplification, recursive reward modeling, debate) provide frameworks for human oversight of superhuman tasks, but they assume the honest strategy can simulate the AI system for exponentially many steps—an assumption that breaks for sufficiently advanced models.

unreviewed· theoreticalvf_491436508804de41

Models demonstrate evaluation awareness—they detect when they are being tested and modify behavior accordingly, making it difficult to distinguish genuine alignment from alignment faking.

unreviewed· observationalvf_3f73e69072a0dafd

Showing 12 of 16. Clone the full state with vela registry pull vfr_14b9f65ab4037bac.