transformer_lens.model_bridge.supported_architectures.codegen module¶
CodeGen architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.codegen.CodeGenArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for CodeGen models.
CodeGen uses a parallel attention+MLP block (attn and MLP share the same LayerNorm input and their outputs are summed). The attention layer uses a fused
qkv_projweight whose layout follows GPT-J’smp_num=4tensor-parallel partitioning: the rows are interleaved as[Q_part, V_part, K_part]within each of the 4 MP partitions.Optional Parameters (may be absent in some CodeGen checkpoints):¶
No bias on qkv_proj (fused QKV has no bias)
No bias on out_proj
No bias on mlp.fc_in or mlp.fc_out
- __init__(cfg: Any) None¶
Initialize the CodeGen architecture adapter.
- split_qkv_matrix(attn_component: Any) tuple[Linear, Linear, Linear]¶
Split the fused QKV weight into separate Q, K, V linear modules.
CodeGen uses GPT-J-style tensor-parallel partitioning with
mp_num=4partitions. Within each partition the row order is[Q_part, V_part, K_part], i.e. not the conventional Q/K/V order.The fused weight has shape
[3 * n_embd, n_embd]. We reshape to[mp_num, 3, local_dim, n_embd], extract the three slices, then flatten back to[n_embd, n_embd]for each of Q, K, V.- Parameters:
attn_component – The original
CodeGenAttentionmodule.- Returns:
Tuple of
(q_linear, k_linear, v_linear)— threenn.Linearmodules with no bias and weight shape[n_embd, n_embd].