transformer_lens.tools.analysis.direct_logit_attribution module¶
Direct Logit Attribution (DLA).
Direct Logit Attribution decomposes a model’s output logit (or a logit difference between a correct and an incorrect token) into the additive contributions of upstream components — the embedding, each attention and MLP sublayer, or each individual attention head. Because the unembedding is linear and the residual stream is a sum of component outputs, the final logit is (up to the final LayerNorm) a sum of per-component dot products with the unembedding direction of the token of interest. DLA reads off those dot products. See the logit lens and Interpretability in the Wild for the canonical uses.
This module exposes a single entry point, direct_logit_attribution(),
that wraps the lower-level ActivationCache primitives
(decompose_resid(),
accumulated_resid(),
stack_head_results()
and logit_attrs()) into
one call. It works unchanged with both HookedTransformer and
TransformerBridge because they share the cache API.
Example:
from transformer_lens import HookedTransformer
from transformer_lens.tools.analysis import direct_logit_attribution
model = HookedTransformer.from_pretrained("gpt2", device="cpu")
result = direct_logit_attribution(
model,
"The Eiffel Tower is in the city of",
answer_tokens=" Paris",
incorrect_tokens=" London",
unit="component",
)
for label, value in zip(result.labels, result.attribution.squeeze()):
print(f"{label:>12}: {value.item():+.3f}")
- class transformer_lens.tools.analysis.direct_logit_attribution.DirectLogitAttribution(attribution: Float[Tensor, 'component *batch_and_pos'], labels: List[str], unit: str)¶
Bases:
objectResult of a
direct_logit_attribution()call.- attribution¶
Tensor of logit (or logit-difference) attributions with shape
[component, *batch_and_pos]. The leading axis is aligned withlabels. Whenposselects a single position (the default) the position axis is dropped, leaving[component, batch]— or[component]if the cache had its batch dimension removed.- Type:
jaxtyping.Float[Tensor, ‘component *batch_and_pos’]
- labels¶
Human-readable name for each component, aligned with the leading axis of
attribution(e.g."embed","0_attn_out","L3H7").- Type:
List[str]
- unit¶
The decomposition unit used (“component”, “layer”, or “head”).
- Type:
str
- attribution: Float[Tensor, 'component *batch_and_pos']¶
- labels: List[str]¶
- top(k: int = 5) List[tuple]¶
Return the
khighest-attribution(label, value)pairs.Attribution is reduced to a scalar per component by meaning over any remaining batch/position dimensions, so this is most meaningful when a single position was selected.
- unit: str¶
- transformer_lens.tools.analysis.direct_logit_attribution.direct_logit_attribution(model, input: str | List[str] | Tensor | None = None, answer_tokens: str | int | Tensor | None = None, incorrect_tokens: str | int | Tensor | None = None, *, unit: str = 'component', pos: int | Tuple[int] | Tuple[int, int] | Tuple[int, int, int] | List[int] | Tensor | ndarray | None = -1, cache: ActivationCache | None = None) DirectLogitAttribution¶
Compute Direct Logit Attribution for a prompt.
Decomposes the contribution of model components to the logit of
answer_tokens(or, ifincorrect_tokensis given, to the logit differenceanswer - incorrectalong theW_Udirection, which is usually what you want for circuit analysis).The model is run once with caching unless a precomputed
cacheis passed. Works with bothHookedTransformerandTransformerBridge.Note that DLA attributes only the part of a logit that comes from the residual stream through the unembedding direction; the unembedding bias
b_Uis a per-token constant that no component produces. So a complete decomposition reconstructslogit[token] - b_U[token]rather than the raw logit.On a
TransformerBridge, compatibility mode must be enabled (so the final LayerNorm is folded intoW_U) — otherwise the projection direction is wrong and DLA returns silently incorrect numbers. Hybrid architectures (Mamba/SSM/Mixer/LinearAttention) are not yet supported becausedecompose_residonly understands theattn_out + mlp_outblock layout; both conditions raise an explicit error at call time.- Parameters:
model – A
HookedTransformerorTransformerBridge(the latter withenable_compatibility_mode()already called).input – Prompt to run — a string, list of strings, or token tensor. Optional only when a precomputed
cacheis supplied.answer_tokens – The correct token(s) to attribute, as a string, id, or tensor. A string is converted with
model.to_single_token.incorrect_tokens – Optional baseline token(s). When given, attribution is computed for the
answer - incorrectresidual direction. Must broadcast to the same shape asanswer_tokens.unit –
Decomposition granularity:
"component"(default): embedding + each layer’s attention and MLP output (viadecompose_resid)."layer": cumulative residual stream after each sublayer, i.e. logit-lens trajectory (viaaccumulated_resid)."head": each attention head individually, plus a remainder term for everything else (viastack_head_results).
pos – Sequence position(s) to attribute. Defaults to
-1(the final token, the usual choice for next-token DLA). PassNoneto keep every position (the result then has a trailing position axis).cache – Optional precomputed
ActivationCacheto reuse instead of running the model again.
- Returns:
A
DirectLogitAttributionwithattribution(shape[component, *batch_and_pos]) and alignedlabels.- Raises:
ValueError – If
unitis invalid,answer_tokensisNone, neitherinputnorcacheis provided, or aTransformerBridgeis passed without compatibility mode enabled.NotImplementedError – If a
TransformerBridgereports a hybrid block layout (Mamba/SSM/Mixer/LinearAttention).