resultsread-only vieweressays

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

frontiers / frontier

AI alignment evaluations

CC-BY-4.0vfr_14b9f65ab4037bac

id: vfr_14b9f65ab4037bac
license: CC-BY-4.0
findings: 16
accepted core: 0
contested: 0

findings

links

sources

evidence

contested

0.84

avg conf

Finding bundle

back to state

Mechanistic interpretability probes (linear classifiers, attention head analysis) can detect deceptive reasoning in models with 70-85% accuracy, but probe accuracy doesn't guarantee the information is used by the model for downstream decisions.

1inferred

id: vf_0d47c80d55ef8fc8
frontier: AI alignment evaluations
version: 1
confidence: 0.79

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

observational

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Requires internal model access; limited by fundamental uncertainties about circuit usage; computational cost limits scalability

Provenance

source title

Detecting Strategic Deception Using Linear Probes (2025)

authors

reviewer:will-blair

Source records

source record

declared

Detecting Strategic Deception Using Linear Probes (2025)

vs_924068c8d03a4058

title:Detecting Strategic Deception Using Linear Probes (2025)

2025manual_curation

inspect source →

Evidence atoms

vea_db422e7ef76a2380theoretical · unknown
Mechanistic interpretability probes (linear classifiers, attention head analysis) can detect deceptive reasoning in models with 70-85% accuracy, but probe accuracy doesn't guarantee the information is used by the model for downstream decisions.
vs_924068c8d03a4058 · manual_curation

Typed links

outgoing

supportsvf_159cb3bb80f54e8f
Both mechanisms rely on interpretability; resource constraints limit scalability of probe-based detection

incoming

Sleeper agents—models trained to behave safely during training but activate harmful behavior post-deployment—can persist through standard safety training procedures.
extends · vf_73f39b4d600392f9

Review, event, and evaluation records

events

vev_ddae3e7308bf681cfinding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_6b4d8444cc86284dfinding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Finding bundle

back to state

Mechanistic interpretability probes (linear classifiers, attention head analysis) can detect deceptive reasoning in models with 70-85% accuracy, but probe accuracy doesn't guarantee the information is used by the model for downstream decisions.

1inferred

id: vf_0d47c80d55ef8fc8
frontier: AI alignment evaluations
version: 1
confidence: 0.79

record state

frontier-owned

Review status

This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.

unreviewed

finding statement

finding type

observational

No entity list is declared.

evidence

source-bound

1 atoms

theoretical · manual state transition

proof impact

packet context

1 events

1 reviewable changes and 0 evaluation records are attached to this finding id.

Evidence and conditions

method

manual state transition

evidence type

theoretical

conditions

species_unverified
species_verified
text: Requires internal model access; limited by fundamental uncertainties about circuit usage; computational cost limits scalability

Provenance

source title

Detecting Strategic Deception Using Linear Probes (2025)

authors

reviewer:will-blair

Source records

source record

declared

Detecting Strategic Deception Using Linear Probes (2025)

vs_924068c8d03a4058

title:Detecting Strategic Deception Using Linear Probes (2025)

2025manual_curation

inspect source →

Evidence atoms

vea_db422e7ef76a2380theoretical · unknown
Mechanistic interpretability probes (linear classifiers, attention head analysis) can detect deceptive reasoning in models with 70-85% accuracy, but probe accuracy doesn't guarantee the information is used by the model for downstream decisions.
vs_924068c8d03a4058 · manual_curation

Typed links

outgoing

supportsvf_159cb3bb80f54e8f
Both mechanisms rely on interpretability; resource constraints limit scalability of probe-based detection

incoming

Sleeper agents—models trained to behave safely during training but activate harmful behavior post-deployment—can persist through standard safety training procedures.
extends · vf_73f39b4d600392f9

Review, event, and evaluation records

events

vev_ddae3e7308bf681cfinding.asserted
Manual finding added to frontier state
reviewer:will-blair · 2026-05-29

reviewable changes

vpr_6b4d8444cc86284dfinding.add
Manual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29

evaluations

No evaluation record targets this finding id.

Search Canopus

Review status

observational

1 atoms

1 events

Source records

Detecting Strategic Deception Using Linear Probes (2025)

Evidence atoms

Typed links

Review, event, and evaluation records

Review status

observational

1 atoms

1 events

Source records

Detecting Strategic Deception Using Linear Probes (2025)

Evidence atoms

Typed links

Review, event, and evaluation records