transformer_lens.components.attention#

Hooked Transformer Attention Component.

This module contains all the component Attention.

class transformer_lens.components.attention.Attention(cfg: Dict | HookedTransformerConfig, attn_type: str = 'global', layer_id: int | None = None)#

Bases: AbstractAttention

__init__(cfg: Dict | HookedTransformerConfig, attn_type: str = 'global', layer_id: int | None = None)#

Attention Block - params have shape [head_index, d_model, d_head] (or [head_index, d_head, d_model] for W_O) and multiply on the right. attn_scores refers to query key dot product immediately before attention softmax

Convention: All attention pattern-style matrices have shape [batch, head_index, query_pos, key_pos]

Parameters:

cfg (Union[Dict, HookedTransformerConfig]) – Config
attn_type (str, optional) – “global” or “local”, used by GPT-Neo. Local attention means the model can only attend back cfg.window_size tokens (here, 256). Not used by any other model at the moment. Defaults to “global”.
layer_id (int, optional) – The index of the current layer. Used by the Mistal models (labelled here as stanford-gpt2) to scale down attention scores pre softmax for numerical stability reasons by 1/(layer_id+1). Defaults to None.