transformer_lens.model_bridge.generalized_components.codegen_attention module¶
CodeGen-specific attention bridge component.
CodeGen attention uses a fused QKV projection (qkv_proj) with a GPT-J-style
rotate_every_two rotary positional encoding applied to Q and K before the
attention matmul. The rotary embeddings are stored as a sinusoidal buffer
(embed_positions) on the original CodeGenAttention module and are
indexed by position_ids.
- Optional parameters (may be absent in some CodeGen checkpoints):
rotary_dim: if None, RoPE is applied to the full head dimension.
- class transformer_lens.model_bridge.generalized_components.codegen_attention.CodeGenAttentionBridge(name: str, config: Any, split_qkv_matrix: Callable | None = None, submodules: Dict[str, GeneralizedComponent] | None = None, qkv_conversion_rule: BaseTensorConversion | None = None, attn_conversion_rule: BaseTensorConversion | None = None, pattern_conversion_rule: BaseTensorConversion | None = None)¶
Bases:
JointQKVAttentionBridgeAttention bridge for CodeGen models.
CodeGen uses: - A fused
qkv_projlinear (no bias). - GPT-J-stylerotate_every_twoRoPE applied to Q and K before theattention matmul. Rotary embeddings are stored in the
embed_positionsbuffer of the originalCodeGenAttentionmodule and indexed byposition_ids.Only the first
rotary_dimdimensions of each head are rotated. Whenrotary_dimis None the full head dimension is rotated.An
out_projlinear output projection (no bias).
All TransformerLens hooks fire in the forward pass:
hook_q,hook_k,hook_v,hook_attn_scores,hook_pattern,hook_z(viao.hook_in),hook_result(viahook_out).- __init__(name: str, config: Any, split_qkv_matrix: Callable | None = None, submodules: Dict[str, GeneralizedComponent] | None = None, qkv_conversion_rule: BaseTensorConversion | None = None, attn_conversion_rule: BaseTensorConversion | None = None, pattern_conversion_rule: BaseTensorConversion | None = None) None¶
Initialise the CodeGen attention bridge.
- Parameters:
name – The name of this component.
config – Model configuration (must have
n_heads,d_head, and optionallyrotary_dim).split_qkv_matrix – Callable that splits the fused QKV weight into three
nn.Linearmodules for Q, K, and V. Required — there is no sensible default for CodeGen’s mp_num=4 split logic.submodules – Optional extra submodules to register.
qkv_conversion_rule – Optional conversion rule for Q/K/V outputs.
attn_conversion_rule – Optional conversion rule for the attention output.
pattern_conversion_rule – Optional conversion rule for attention patterns.
- forward(*args: Any, **kwargs: Any) Any¶
Forward pass through CodeGen attention with all hooks firing.
Manually reconstructs attention so that all TransformerLens hooks (hook_q, hook_k, hook_v, hook_attn_scores, hook_pattern, hook_z, hook_result) fire correctly.
CodeGen passes
position_idsas a keyword argument; these are used to index into theembed_positionssinusoidal buffer stored on the originalCodeGenAttentionmodule.- Parameters:
*args – Positional arguments; the first must be
hidden_states.**kwargs – Keyword arguments including
position_ids(required for RoPE),attention_mask(optional),layer_past(optional KV cache), andcache_position(optional).
- Returns:
Tuple of
(attn_output, attn_weights).
- get_random_inputs(batch_size: int = 2, seq_len: int = 8, device=None, dtype=None)¶
Return random inputs for isolated component testing.
CodeGen attention requires
position_ids(to index intoembed_positions) and a HuggingFace-style 4D causal attention mask. The mask is provided so that both the bridge and the HF component apply identical causal masking during theall_componentsbenchmark.- Parameters:
batch_size – Batch size.
seq_len – Sequence length.
device – Target device (defaults to CPU).
dtype – Tensor dtype (defaults to float32).
- Returns:
Dict with
hidden_states,position_ids, andattention_masksuitable for both bridge and HF forward calls.
- set_original_component(original_component: Module) None¶
Wire the original CodeGenAttention and set up the output projection.
The base
JointQKVAttentionBridge.set_original_componenthardcodesc_projfor the output projection wiring. CodeGen usesout_projinstead, so we override here to wire it correctly after calling super.- Parameters:
original_component – The original
CodeGenAttentionlayer.