transformer_lens.components.attention#
Hooked Transformer Attention Component.
This module contains all the component Attention
.
- class transformer_lens.components.attention.Attention(cfg: Union[Dict, HookedTransformerConfig], attn_type: str = 'global', layer_id: Optional[int] = None)#
Bases:
AbstractAttention
- __init__(cfg: Union[Dict, HookedTransformerConfig], attn_type: str = 'global', layer_id: Optional[int] = None)#
Attention Block - params have shape [head_index, d_model, d_head] (or [head_index, d_head, d_model] for W_O) and multiply on the right. attn_scores refers to query key dot product immediately before attention softmax
Convention: All attention pattern-style matrices have shape [batch, head_index, query_pos, key_pos]
- Parameters:
cfg (Union[Dict, HookedTransformerConfig]) – Config
attn_type (str, optional) – “global” or “local”, used by GPT-Neo. Local attention means the model can only attend back cfg.window_size tokens (here, 256). Not used by any other model at the moment. Defaults to “global”.
layer_id (int, optional) – The index of the current layer. Used by the Mistal models (labelled here as stanford-gpt2) to scale down attention scores pre softmax for numerical stability reasons by 1/(layer_id+1). Defaults to None.