Skip to content

canopusrecord engine body

field
frontiers
loop
body
docs

vela-science/vela

frontier

open-transformer-circuit-evidence

state

resultsread-only vieweressays

Search Canopus

Jump to a section, signal, campaign, document, primitive, work path, frontier, record index, atlas, constellation, agent, capability, or full-state search.

frontiers / frontier

Open-transformer circuit evidence

CC-BY-4.0vfr_c94f4f79b4e341ad

id: vfr_c94f4f79b4e341ad
license: CC-BY-4.0
findings: 42
accepted core: 14
contested: 0

42

findings

6

links

22

sources

42

evidence

0

contested

0.73

avg conf

recordoverview state sources proofenginereviewbodygraph

State

The accepted finding bundles: the reviewed findings that make up this frontier’s state. Each carries its statement, the evidence and confidence behind it, its review state, and its links to other findings.

by type

computational22
observational9
theoretical9
methodological2

by review state

unreviewed42

finding bundle anatomy

A finding bundle is the durable object. Source graphs, citation stance, candidate gaps, and agent summaries are derived signals until a review event accepts a reviewable change into this record.

finding statement

In GPT-2 small, attention head L5H1 acts as an induction head (induction score > 0.9, replicated across five held-out random-repeat prompts and confirmed in distilgpt2).

computational

evidence

theoretical · manual state transition

42 atoms

provenance

manual finding

review state

The example finding is unreviewed. Frontier changes still pass through reviewable changes and accepted events.

167 accepted events

derived signals

Links, candidate gaps, bridges, citation stance, nearby papers, and generated summaries route review. They do not rewrite the record by themselves.

5 linked findings186 reviewable changes

42 findings

proposedunder reviewacceptedcontestedunreviewedretracted

finding bundle
unreviewed
In GPT-2 small, attention head L5H1 acts as an induction head (induction score > 0.9, replicated across five held-out random-repeat prompts and confirmed in distilgpt2).
vf_41d92edaba755cd1
computational0.90 confidence
evidence unit
theoretical · manual state transition
source handle
manual finding
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L0H5 acts as a duplicate token head (min held-out score 0.6371 across 5 prompts; role also present in distilgpt2).
vf_4bca49d448bf6d36
computational0.64 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L3H0 acts as a duplicate token head (min held-out score 0.6205 across 5 prompts; role also present in distilgpt2).
vf_61257c9696ec7855
computational0.62 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L5H1 acts as a induction head (min held-out score 0.9004 across 5 prompts; role also present in distilgpt2).
vf_76ddacf8a33ee0cd
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L5H5 acts as a induction head (min held-out score 0.919 across 5 prompts; role also present in distilgpt2).
vf_23bfd31465df9981
computational0.92 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L6H9 acts as a induction head (min held-out score 0.8665 across 5 prompts; role also present in distilgpt2).
vf_a75996be47f4909d
computational0.87 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L7H10 acts as a induction head (min held-out score 0.9004 across 5 prompts; role also present in distilgpt2).
vf_78e0a960c131c2be
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L2H2 acts as a previous token head (min held-out score 0.475 across 5 prompts; role also present in distilgpt2).
vf_7c015dca286a122a
computational0.47 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L3H7 acts as a previous token head (min held-out score 0.5084 across 5 prompts; role also present in distilgpt2).
vf_f4c4057af4f9c42e
computational0.51 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L4H11 acts as a previous token head (min held-out score 0.9823 across 5 prompts; role also present in distilgpt2).
vf_6384b864e00c7479
computational0.98 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
In gpt2, attention head L6H8 acts as a previous token head (min held-out score 0.4766 across 5 prompts; role also present in distilgpt2).
vf_11f63ac26bc557f4
computational0.48 confidence
evidence unit
computational · manual state transition
source handle
mechinterp circuit harness
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
The set of 15 induction heads in gpt2 (top heads L7H10, L5H1, L5H5, L6H9, L7H2) is causally necessary for in-context repetition: group mean-ablation raises repeat-prediction loss by Delta=8.354 nats (per-seed deltas 8.56, 8.40, 8.96, 8.13, 7.72, all positive), versus ~0 for a control head set.
vf_59285c747f083c24
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
The set of 6 induction heads in distilgpt2 (top heads L3H10, L3H2, L4H6, L4H9, L3H11, L5H9) is causally necessary for in-context repetition: group mean-ablation raises repeat-prediction loss by Delta=8.401 nats (per-seed deltas 9.30, 9.00, 8.48, 7.55, 7.67, all positive), versus ~0 for a control head set.
vf_d6c317a8ced07e36
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
The set of 5 induction heads in pythia-70m (top heads L3H6, L3H5, L3H1, L3H0, L4H7) is causally necessary for in-context repetition: group mean-ablation raises repeat-prediction loss by Delta=4.949 nats (per-seed deltas 4.03, 3.85, 5.22, 5.54, 6.11, all positive), versus ~0 for a control head set.
vf_3595f5e61d02769d
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
The set of 18 induction heads in pythia-160m (top heads L4H6, L8H2, L4H10, L4H8, L5H0) is causally necessary for in-context repetition: group mean-ablation raises repeat-prediction loss by Delta=11.520 nats (per-seed deltas 8.81, 8.55, 13.17, 10.99, 16.08, all positive), versus ~0 for a control head set.
vf_da0f52be04ee54bc
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
The set of 29 induction heads in gpt2-medium (top heads L9H9, L11H1, L18H5, L7H2, L6H1) is causally necessary for in-context repetition: group mean-ablation raises repeat-prediction loss by Delta=8.586 nats (per-seed deltas 8.53, 8.61, 8.84, 8.55, 8.41, all positive), versus ~0 for a control head set.
vf_1565bc7a58ce0108
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Individual induction heads in gpt2 are highly redundant: single-head mean-ablation deltas are tiny relative to the group delta of 8.354 (L7H10 = 0.0011, ~0%; L5H5 = 0.053, 0.6%), with the sole partial exception L5H1 = 0.714 (8.5%). Group/single amplification is 11.7x, so causal necessity is a property of the ensemble, not any single head.
vf_3e68a029a451f01f
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Individual induction heads in gpt2-medium are highly redundant: the top-3 attention-scoring heads contribute single-ablation deltas of only 0.048 (L9H9), 0.011 (L11H1), and 0.024 (L18H5) against a 29-head group delta of 8.586 — each under 0.6% of the group effect — indicating dense distributed representation across the ensemble.
vf_4547fdc4c89640a3
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Induction-head redundancy is not universal across scale: in pythia-70m individual heads remain load-bearing, with single-ablation deltas of 0.61 (L3H6, 12% of group), 0.58 (L3H5, 12%), and 1.22 (L3H1, 25%) against a group delta of 4.949 — a cooperative rather than purely-redundant circuit, contrasting the heavy redundancy seen in the larger GPT2-family and pythia-160m models.
vf_b9b16eba11ee2668
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Across all five transformers (gpt2, distilgpt2, pythia-70m, pythia-160m, gpt2-medium), the in-context-repetition circuit is universal and stereotyped by depth: duplicate-token heads emerge shallowest (layers 0-1), previous-token heads at early-to-mid depth, and induction heads in the middle-to-deep band (~30-90% relative depth); the induction-head group is causally necessary in every model (group-ablation delta 4.95-11.52 nats, all per-seed deltas positive).
vf_f41d996b7c3aea03
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Induction head circuits for pattern matching scale consistently from 13B to 70B models with preserved copy-match mechanism, but task-specific semantic variants (n-gram, concept-level) emerge without evidence that the base mechanism remains universally identical in larger models.
vf_de8d1e8b8646db6a
observational0.78 confidence
evidence unit
theoretical · manual state transition
source handle
Beyond Induction Heads (2025); Induction Heads & In-Context Learning (emergentmind.com)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Attention heads in transformers exhibit bimodal stability: middle layers show <40% seed-consistent head mappings despite identical optimizer/architecture, while early and late layers exceed 70% stability, directly contradicting universality claims for circuits derived from single model instances.
vf_182cb4f97418dc5d
observational0.82 confidence1 link
evidence unit
theoretical · manual state transition
source handle
Quantifying LLM Attention-Head Stability (2026)
review state
unreviewed
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewed
Sparse Autoencoders achieve 68% improvement in concept separability (0.405→0.680) and reduce polysemantic neuron overlap from ~30% to ~10%, yet increased sparsity beyond optimal threshold degrades monosemanticity by up to 18%, indicating a non-monotonic relationship between sparsity and interpretability.
vf_ec692e5e87df4298
observational0.85 confidence
evidence unit
theoretical · manual state transition
source handle
Evaluating Sparse Autoencoders for Monosemantic Representation (2024)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Feature-level mechanistic similarity across Transformer and Mamba architectures averages 0.74 Pearson correlation for simple features (tokens, parts-of-speech) but drops to 0.3-0.8 for complex semantic features, with 'off-by-one' position shift emerging as irreducible architectural difference.
vf_3b8b7f0eb1dd2305
observational0.79 confidence1 link
evidence unit
theoretical · manual state transition
source handle
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures (2024)
review state
unreviewed
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewed
Hybrid circuit discovery (combining Edge Attribution Patching with Edge Pruning) achieves 46% speedup over pure pruning while maintaining faithfulness, yet misses cooperative multi-head inhibition patterns that EAP alone cannot detect due to zero individual attribution scores.
vf_3cbd8304240a3219
observational0.81 confidence2 links
evidence unit
theoretical · manual state transition
source handle
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework (2024)
review state
unreviewed
downstream effect
2 downstream links
inspect finding →
finding bundle
unreviewed
Residual stream activations show 30-50% higher seed consistency than individual attention head activations across model refits, suggesting residual-level features rather than head-specific circuits provide more robust ground truth for circuit discovery and universality claims.
vf_841cfc4181456ee4
theoretical0.72 confidence1 link
evidence unit
theoretical · manual state transition
source handle
Quantifying LLM Attention-Head Stability (2026)
review state
unreviewed
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewed
In-context meta-learning circuits emerge in three sequential phases (non-context, semi-context, full-context) with distinct attention patterns, predicting that phase transitions occur at different model scales and that phase timing correlates with downstream task generalization on out-of-distribution prompts.
vf_39e5f05fb44e2760
theoretical0.68 confidence
evidence unit
theoretical · manual state transition
source handle
Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence (2025)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Circuit-Aware Reward Training methodology identifies specialized neural circuits in RLHF reward models responsible for longtail distribution failures and reward hacking, predicting that mechanistic oversight via circuit ablation reduces spurious reward alignment by >40% on adversarial examples.
vf_9e8edcb419fd0229
theoretical0.65 confidence1 link
evidence unit
theoretical · manual state transition
source handle
Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2025)
review state
unreviewed
downstream effect
1 downstream link
inspect finding →
finding bundle
unreviewed
Polysemantic features in SAEs persist at ~10%+ of salient neurons despite 68% separability improvement, predicting that complete monosemanticity requires either higher expansion ratios (>32×) or algorithmic changes to enforce orthogonal concept bases.
vf_05353ab782524863
theoretical0.71 confidence
evidence unit
theoretical · manual state transition
source handle
Evaluating Sparse Autoencoders for Monosemantic Representation (2024)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Weight-sparse transformers (<40% parameter density) develop interpretable, disentangled circuits that require fewer and more modular causal intervention points compared to dense models, predicting that sparse models allow subcircuit discovery 3-5× faster than dense equivalents.
vf_7a690fac11e87c30
theoretical0.64 confidence
evidence unit
theoretical · manual state transition
source handle
Weight-sparse transformers have interpretable circuits (2024)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Contextual decomposition for transformers (CD-T) explains contribution of input feature combinations without model modification, predicting that CD-T scales to attribution of sub-circuit interactions (e.g., cross-layer attention) with <20% computational overhead compared to activation patching.
vf_efe9ddeab6b12e54
theoretical0.62 confidence
evidence unit
theoretical · manual state transition
source handle
Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition (2024)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Mechanistic interpretability of transformers at multiple scales demonstrates consistent algorithmic motifs (e.g., induction heads, S-inhibition patterns), predicting that these motifs are necessary solutions to universal attention-like computation and will appear in all scale regimes and diverse architectural variants.
vf_777e7fc8759edf7e
theoretical0.70 confidence
evidence unit
theoretical · manual state transition
source handle
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models (2024)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Can circuits discovered on dense 13B models via seed-independent SAE extraction transfer to sparse 70B models without requiring new circuit discovery, and if so, does the transfer fidelity remain >75% on adversarial test cases?
vf_f23ae921f4703f37
observational0.45 confidence
evidence unit
theoretical · manual state transition
source handle
Open Problems in Mechanistic Interpretability (Jan 2025); Quantifying LLM Attention-Head Stability (2026)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
What determines the layer depth at which specific circuit motifs (e.g., composition, memorization, in-context learning) emerge during training, and does this emergence order remain invariant across model scales (7B→70B) and architectures (Transformer→Mamba)?
vf_b10bdbb1f34c381e
observational0.40 confidence
evidence unit
theoretical · manual state transition
source handle
Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence (2025); Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures (2024)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Why do middle layers (l=8-12 in 24-layer models) show 40-60% lower attention head seed consistency despite being most performant for downstream tasks, and does this instability reflect genuine functional redundancy or suboptimal training dynamics?
vf_672bfc44704d40fe
observational0.55 confidence
evidence unit
theoretical · manual state transition
source handle
Quantifying LLM Attention-Head Stability: Implications for Circuit Universality (2026)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Can automated circuit discovery (ACDC, EAP, HAP) scale beyond 13B parameters to GPT-4 scale (100B+) without exponential growth in compute, and what is the theoretical lower bound on circuit faithfulness for any pruning-based method?
vf_7945671662762af1
methodological0.35 confidence
evidence unit
theoretical · manual state transition
source handle
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework (2024); Towards Automated Circuit Discovery for Mechanistic Interpretability (2024)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
What is the minimal SAE expansion factor needed to eliminate the remaining ~10% polysemantic neurons in LLM features, and does this threshold depend on model size, dataset, or semantic domain?
vf_6d5cc01c380c814c
observational0.50 confidence
evidence unit
theoretical · manual state transition
source handle
Evaluating Sparse Autoencoders for Monosemantic Representation (2024); Sparse Autoencoders Find Highly Interpretable Features in Language Models (2023)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Are induction head generalizations (n-gram copying, concept-level matching) implemented by parametric variations of a single universal circuit motif, or do they represent fundamentally distinct computational patterns that coincidentally perform similar input-output transformations?
vf_22ce0bb4da3c4146
theoretical0.48 confidence
evidence unit
theoretical · manual state transition
source handle
Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence (2025); Induction Heads in Transformers (emergentmind.com)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Why do circuit faithfulness metrics (KL divergence, logit difference) fail to detect cooperative inhibition heads that individually score near-zero on attribution but prove critical for behavior, and what principled attribution metric would catch such non-additive interactions?
vf_d3dd34cd06e3d5ce
methodological0.60 confidence
evidence unit
theoretical · manual state transition
source handle
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework (2024); Transformer Circuit Faithfulness Metrics are not Robust (2024)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Can mechanistic interpretability principles (circuit universality, causal interventions, sparse activation) scale to explain and steer multi-agent emergent behaviors in transformer ensembles or mixture-of-experts, or do system-level interactions exceed individual circuit analysis?
vf_9d8e0c6b076d22e3
theoretical0.30 confidence
evidence unit
theoretical · manual state transition
source handle
Open Problems in Mechanistic Interpretability (Jan 2025)
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
Across the 9 successfully-swept models spanning 70M-1B params and four architecture families (Pythia 70m/160m/1b, GPT-2 distil/base/medium/large, OPT-125m, GPT-Neo-125M), induction-head COUNT rises with model size only within a fixed architecture (GPT-2: 6->15->29->34; Pythia: 5->18 from 70m to 160m) but NOT globally: Spearman(params, head-count)=+0.60 and the largest model pythia-1b regresses to just 10 heads. The group-ablation EFFECT does not scale with size at all (Spearman(params, group-delta)=-0.12; range 4.87-11.52 nats with no size trend). The robust scaling claim is therefore: detecte
vf_1251bfd72b49c1ef
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep wave2
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →
finding bundle
unreviewed
The induction circuit and its component-role depth ordering (duplicate-token heads shallowest, then previous-token heads, then induction heads deepest) is universal across the newly-swept architectures OPT-125m, GPT-Neo-125M, GPT-2-large, and Pythia-1b, extending the prior 5-model result to four additional/new architecture families. Concretely: OPT-125m duplicate at L0-L2, previous-token L0-L7, induction L5-L11; GPT-Neo duplicate at L0/L6, previous-token L3-L5, induction L6-L8; gpt2-large duplicate at L0-L12, previous-token concentrated L4-L20, induction L11-L34; pythia-1b duplicate L1/L4/L5,
vf_47c5956978a83e65
computational0.90 confidence
evidence unit
computational · manual state transition
source handle
mechinterp causal sweep wave2
review state
unreviewed
downstream effect
no declared downstream links
inspect finding →