transformer_lens.head_detector#

Head Detector.

Utilities for detecting specific types of heads (e.g. previous token heads).

transformer_lens.head_detector.compute_head_attention_similarity_score(attention_pattern: Tensor, detection_pattern: Tensor, *, exclude_bos: bool, exclude_current_token: bool, error_measure: Literal['abs', 'mul']) → float#

Compute the similarity between attention_pattern and detection_pattern.

Parameters:

attention_pattern – Lower triangular matrix (Tensor) representing the attention pattern of a particular attention head.
detection_pattern – Lower triangular matrix (Tensor) representing the attention pattern we are looking for.
exclude_bos – True if the beginning-of-sentence (BOS) token should be omitted from comparison. False otherwise.
exclude_bcurrent_token – True if the current token at each position should be omitted from comparison. False otherwise.
error_measure – “abs” for using absolute values of element-wise differences as the error measure. “mul” for using element-wise multiplication (legacy code).

transformer_lens.head_detector.detect_head(model: HookedTransformer, seq: str | List[str], detection_pattern: Tensor | Literal['previous_token_head', 'duplicate_token_head', 'induction_head'], heads: List[Tuple[int, int]] | Dict[int, List[int]] | None = None, cache: ActivationCache | None = None, *, exclude_bos: bool = False, exclude_current_token: bool = False, error_measure: Literal['abs', 'mul'] = 'mul') → Tensor#

Search for a Particular Type of Attention Head.

Searches the model (or a set of specific heads, for circuit analysis) for a particular type of attention head. This head is specified by a detection pattern, a (sequence_length, sequence_length) tensor representing the attention pattern we expect that type of attention head to show. The detection pattern can be also passed not as a tensor, but as a name of one of pre-specified types of attention head (see HeadName for available patterns), in which case the tensor is computed within the function itself.

There are two error measures available for quantifying the match between the detection pattern and the actual attention pattern.

“mul” (default) multiplies both tensors element-wise and divides the sum of the result by
the sum of the attention pattern. Typically, the detection pattern should in this case contain only ones and zeros, which allows a straightforward interpretation of the score: how big fraction of this head’s attention is allocated to these specific query-key pairs? Using values other than 0 or 1 is not prohibited but will raise a warning (which can be disabled, of course).
“abs” calculates the mean element-wise absolute difference between the detection pattern
and the actual attention pattern. The “raw result” ranges from 0 to 2 where lower score corresponds to greater accuracy. Subtracting it from 1 maps that range to (-1, 1) interval, with 1 being perfect match and -1 perfect mismatch.

Which one should you use?

“mul” is likely better for quick or exploratory investigations. For precise examinations where you’re trying to reproduce as much functionality as possible or really test your understanding of the attention head, you probably want to switch to “abs”.

The advantage of “abs” is that you can make more precise predictions, and have that measured in the score. You can predict, for instance, 0.2 attention to X, and 0.8 attention to Y, and your score will be better if your prediction is closer. The “mul” metric does not allow this, you’ll get the same score if attention is 0.2, 0.8 or 0.5, 0.5 or 0.8, 0.2.

Parameters:

model – Model being used.
seq – String or list of strings being fed to the model.
head_name – Name of an existing head in HEAD_NAMES we want to check. Must pass either a head_name or a detection_pattern, but not both!
detection_pattern – (sequence_length, sequence_length)nTensor representing what attention pattern corresponds to the head we’re looking for or the name of a pre-specified head. Currently available heads are: [“previous_token_head”, “duplicate_token_head”, “induction_head”].
heads – If specific attention heads is given here, all other heads’ score is set to -1. Useful for IOI-style circuit analysis. Heads can be spacified as a list tuples (layer, head) or a dictionary mapping a layer to heads within that layer that we want to analyze. cache: Include the cache to save time if you want.
exclude_bos – Exclude attention paid to the beginning of sequence token.
exclude_current_token – Exclude attention paid to the current token.
error_measure – “mul” for using element-wise multiplication. “abs” for using absolute values of element-wise differences as the error measure.

Returns:

Tensor representing the score for each attention head.

transformer_lens.head_detector.get_duplicate_token_head_detection_pattern(tokens: Tensor) → Tensor#

Outputs a detection score for [duplicate token heads](https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=2UkvedzOnghL5UHUgVhROxeo).

Parameters:: sequence – String being fed to the model.

transformer_lens.head_detector.get_induction_head_detection_pattern(tokens: Tensor) → Tensor#

Outputs a detection score for [induction heads](https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=_tFVuP5csv5ORIthmqwj0gSY).

Parameters:: sequence – String being fed to the model.

transformer_lens.head_detector.get_previous_token_head_detection_pattern(tokens: Tensor) → Tensor#

Outputs a detection score for [previous token heads](https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=0O5VOHe9xeZn8Ertywkh7ioc).

Parameters:: tokens – Tokens being fed to the model.

transformer_lens.head_detector.get_supported_heads() → None#: Returns a list of supported heads.