transformer_lens.model_bridge.generalized_components.codegen_attention module

CodeGen-specific attention bridge component.

CodeGen attention uses a fused QKV projection (qkv_proj) with a GPT-J-style rotate_every_two rotary positional encoding applied to Q and K before the attention matmul. The rotary embeddings are stored as a sinusoidal buffer (embed_positions) on the original CodeGenAttention module and are indexed by position_ids.

Optional parameters (may be absent in some CodeGen checkpoints):
  • rotary_dim: if None, RoPE is applied to the full head dimension.

class transformer_lens.model_bridge.generalized_components.codegen_attention.CodeGenAttentionBridge(name: str, config: Any, split_qkv_matrix: Callable | None = None, submodules: Dict[str, GeneralizedComponent] | None = None, qkv_conversion_rule: BaseTensorConversion | None = None, attn_conversion_rule: BaseTensorConversion | None = None, pattern_conversion_rule: BaseTensorConversion | None = None)

Bases: JointQKVAttentionBridge

Attention bridge for CodeGen models.

CodeGen uses: - A fused qkv_proj linear (no bias). - GPT-J-style rotate_every_two RoPE applied to Q and K before the

attention matmul. Rotary embeddings are stored in the embed_positions buffer of the original CodeGenAttention module and indexed by position_ids.

  • Only the first rotary_dim dimensions of each head are rotated. When rotary_dim is None the full head dimension is rotated.

  • An out_proj linear output projection (no bias).

All TransformerLens hooks fire in the forward pass: hook_q, hook_k, hook_v, hook_attn_scores, hook_pattern, hook_z (via o.hook_in), hook_result (via hook_out).

__init__(name: str, config: Any, split_qkv_matrix: Callable | None = None, submodules: Dict[str, GeneralizedComponent] | None = None, qkv_conversion_rule: BaseTensorConversion | None = None, attn_conversion_rule: BaseTensorConversion | None = None, pattern_conversion_rule: BaseTensorConversion | None = None) None

Initialise the CodeGen attention bridge.

Parameters:
  • name – The name of this component.

  • config – Model configuration (must have n_heads, d_head, and optionally rotary_dim).

  • split_qkv_matrix – Callable that splits the fused QKV weight into three nn.Linear modules for Q, K, and V. Required — there is no sensible default for CodeGen’s mp_num=4 split logic.

  • submodules – Optional extra submodules to register.

  • qkv_conversion_rule – Optional conversion rule for Q/K/V outputs.

  • attn_conversion_rule – Optional conversion rule for the attention output.

  • pattern_conversion_rule – Optional conversion rule for attention patterns.

forward(*args: Any, **kwargs: Any) Any

Forward pass through CodeGen attention with all hooks firing.

Manually reconstructs attention so that all TransformerLens hooks (hook_q, hook_k, hook_v, hook_attn_scores, hook_pattern, hook_z, hook_result) fire correctly.

CodeGen passes position_ids as a keyword argument; these are used to index into the embed_positions sinusoidal buffer stored on the original CodeGenAttention module.

Parameters:
  • *args – Positional arguments; the first must be hidden_states.

  • **kwargs – Keyword arguments including position_ids (required for RoPE), attention_mask (optional), layer_past (optional KV cache), and cache_position (optional).

Returns:

Tuple of (attn_output, attn_weights).

get_random_inputs(batch_size: int = 2, seq_len: int = 8, device=None, dtype=None)

Return random inputs for isolated component testing.

CodeGen attention requires position_ids (to index into embed_positions) and a HuggingFace-style 4D causal attention mask. The mask is provided so that both the bridge and the HF component apply identical causal masking during the all_components benchmark.

Parameters:
  • batch_size – Batch size.

  • seq_len – Sequence length.

  • device – Target device (defaults to CPU).

  • dtype – Tensor dtype (defaults to float32).

Returns:

Dict with hidden_states, position_ids, and attention_mask suitable for both bridge and HF forward calls.

set_original_component(original_component: Module) None

Wire the original CodeGenAttention and set up the output projection.

The base JointQKVAttentionBridge.set_original_component hardcodes c_proj for the output projection wiring. CodeGen uses out_proj instead, so we override here to wire it correctly after calling super.

Parameters:

original_component – The original CodeGenAttention layer.