record state
frontier-ownedReview status
This finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.
frontiers / frontier
Finding bundle
back to staterecord state
frontier-ownedThis finding is part of accepted frontier state. Review events, reviewable changes, and proof state explain how it can change.
finding statement
finding typeNo entity list is declared.
evidence
source-boundtheoretical · manual state transition
proof impact
packet context1 reviewable changes and 0 evaluation records are attached to this finding id.
Evidence and conditions
method
manual state transition
evidence type
theoretical
conditions
Provenance
source title
Detecting Strategic Deception Using Linear Probes (2025)
authors
reviewer:will-blair
Mechanistic interpretability probes (linear classifiers, attention head analysis) can detect deceptive reasoning in models with 70-85% accuracy, but probe accuracy doesn't guarantee the information is used by the model for downstream decisions.
vs_924068c8d03a4058 · manual_curation
outgoing
vf_159cb3bb80f54e8fBoth mechanisms rely on interpretability; resource constraints limit scalability of probe-based detection
events
vev_ddae3e7308bf681cfinding.assertedManual finding added to frontier state
reviewer:will-blair · 2026-05-29
reviewable changes
vpr_6b4d8444cc86284dfinding.addManual finding added to frontier state
applied · reviewer:will-blair · 2026-05-29
evaluations
No evaluation record targets this finding id.