transformer_lens.tools.analysis.direct_logit_attribution module¶

Direct Logit Attribution (DLA).

Direct Logit Attribution decomposes a model’s output logit (or a logit difference between a correct and an incorrect token) into the additive contributions of upstream components — the embedding, each attention and MLP sublayer, or each individual attention head. Because the unembedding is linear and the residual stream is a sum of component outputs, the final logit is (up to the final LayerNorm) a sum of per-component dot products with the unembedding direction of the token of interest. DLA reads off those dot products. See the logit lens and Interpretability in the Wild for the canonical uses.

This module exposes a single entry point, direct_logit_attribution(), that wraps the lower-level ActivationCache primitives (decompose_resid(), accumulated_resid(), stack_head_results() and logit_attrs()) into one call. It works unchanged with both HookedTransformer and TransformerBridge because they share the cache API.

Example:

from transformer_lens import HookedTransformer
from transformer_lens.tools.analysis import direct_logit_attribution

model = HookedTransformer.from_pretrained("gpt2", device="cpu")
result = direct_logit_attribution(
    model,
    "The Eiffel Tower is in the city of",
    answer_tokens=" Paris",
    incorrect_tokens=" London",
    unit="component",
)
for label, value in zip(result.labels, result.attribution.squeeze()):
    print(f"{label:>12}: {value.item():+.3f}")

class transformer_lens.tools.analysis.direct_logit_attribution.DirectLogitAttribution(attribution: Float[Tensor, 'component *batch_and_pos'], labels: List[str], unit: str)¶

Bases: object

Result of a direct_logit_attribution() call.

attribution¶

Tensor of logit (or logit-difference) attributions with shape [component, *batch_and_pos]. The leading axis is aligned with labels. When pos selects a single position (the default) the position axis is dropped, leaving [component, batch] — or [component] if the cache had its batch dimension removed.

Type:: jaxtyping.Float[Tensor, ‘component *batch_and_pos’]

labels¶

Human-readable name for each component, aligned with the leading axis of attribution (e.g. "embed", "0_attn_out", "L3H7").

Type:: List[str]

unit¶

The decomposition unit used (“component”, “layer”, or “head”).

Type:: str

attribution: Float[Tensor, 'component *batch_and_pos']¶

labels: List[str]¶

top(k: int = 5) → List[tuple]¶

Return the k highest-attribution (label, value) pairs.

Attribution is reduced to a scalar per component by meaning over any remaining batch/position dimensions, so this is most meaningful when a single position was selected.

unit: str¶

Compute Direct Logit Attribution for a prompt.

Decomposes the contribution of model components to the logit of answer_tokens (or, if incorrect_tokens is given, to the logit difference answer - incorrect along the W_U direction, which is usually what you want for circuit analysis).

The model is run once with caching unless a precomputed cache is passed. Works with both HookedTransformer and TransformerBridge.

Note that DLA attributes only the part of a logit that comes from the residual stream through the unembedding direction; the unembedding bias b_U is a per-token constant that no component produces. So a complete decomposition reconstructs logit[token] - b_U[token] rather than the raw logit.

On a TransformerBridge, compatibility mode must be enabled (so the final LayerNorm is folded into W_U) — otherwise the projection direction is wrong and DLA returns silently incorrect numbers. Hybrid architectures (Mamba/SSM/Mixer/LinearAttention) are not yet supported because decompose_resid only understands the attn_out + mlp_out block layout; both conditions raise an explicit error at call time.

Parameters:

model – A HookedTransformer or TransformerBridge (the latter with enable_compatibility_mode() already called).
input – Prompt to run — a string, list of strings, or token tensor. Optional only when a precomputed cache is supplied.
answer_tokens – The correct token(s) to attribute, as a string, id, or tensor. A string is converted with model.to_single_token.
incorrect_tokens – Optional baseline token(s). When given, attribution is computed for the answer - incorrect residual direction. Must broadcast to the same shape as answer_tokens.
unit –
Decomposition granularity:
- "component" (default): embedding + each layer’s attention and MLP output (via decompose_resid).
- "layer": cumulative residual stream after each sublayer, i.e. logit-lens trajectory (via accumulated_resid).
- "head": each attention head individually, plus a remainder term for everything else (via stack_head_results).
pos – Sequence position(s) to attribute. Defaults to -1 (the final token, the usual choice for next-token DLA). Pass None to keep every position (the result then has a trailing position axis).
cache – Optional precomputed ActivationCache to reuse instead of running the model again.

Returns:

A DirectLogitAttribution with attribution (shape [component, *batch_and_pos]) and aligned labels.

Raises:

ValueError – If unit is invalid, answer_tokens is None, neither input nor cache is provided, or a TransformerBridge is passed without compatibility mode enabled.
NotImplementedError – If a TransformerBridge reports a hybrid block layout (Mamba/SSM/Mixer/LinearAttention).