transformer_lens.model_bridge.supported_architectures.codegen module

CodeGen architecture adapter.

class transformer_lens.model_bridge.supported_architectures.codegen.CodeGenArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for CodeGen models.

CodeGen uses a parallel attention+MLP block (attn and MLP share the same LayerNorm input and their outputs are summed). The attention layer uses a fused qkv_proj weight whose layout follows GPT-J’s mp_num=4 tensor-parallel partitioning: the rows are interleaved as [Q_part, V_part, K_part] within each of the 4 MP partitions.

Optional Parameters (may be absent in some CodeGen checkpoints):

  • No bias on qkv_proj (fused QKV has no bias)

  • No bias on out_proj

  • No bias on mlp.fc_in or mlp.fc_out

__init__(cfg: Any) None

Initialize the CodeGen architecture adapter.

split_qkv_matrix(attn_component: Any) tuple[Linear, Linear, Linear]

Split the fused QKV weight into separate Q, K, V linear modules.

CodeGen uses GPT-J-style tensor-parallel partitioning with mp_num=4 partitions. Within each partition the row order is [Q_part, V_part, K_part], i.e. not the conventional Q/K/V order.

The fused weight has shape [3 * n_embd, n_embd]. We reshape to [mp_num, 3, local_dim, n_embd], extract the three slices, then flatten back to [n_embd, n_embd] for each of Q, K, V.

Parameters:

attn_component – The original CodeGenAttention module.

Returns:

Tuple of (q_linear, k_linear, v_linear) — three nn.Linear modules with no bias and weight shape [n_embd, n_embd].