transformer_lens.model_bridge.supported_architectures.gpt_bigcode module¶

GPTBigCode architecture adapter.

class transformer_lens.model_bridge.supported_architectures.gpt_bigcode.GPTBigCodeArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for GPTBigCode models.

GPTBigCode is a GPT-2 variant using Multi-Query Attention (MQA): a single fused c_attn projection whose output splits asymmetrically into [embed_dim, head_dim, head_dim] for Q/K/V (rather than three equal thirds). All other structure (module paths, LayerNorm, learned pos embeddings, standard MLP) is identical to GPT-2.

All public models use multi_query=True (1 KV head). The adapter assumes MQA throughout.

All linear layers have biases (c_attn, c_proj, c_fc, mlp.c_proj). lm_head has no bias and its weight is tied to transformer.wte.weight.

Weight layout difference from GPT-2: GPTBigCode uses nn.Linear (weights stored [out, in]) rather than GPT-2’s Conv1D ([in, out]), so no unembed weight transpose is needed.

class transformer_lens.model_bridge.supported_architectures.gpt_bigcode.MQAQKVConversionRule(n_heads: int, d_head: int)¶

Bases: BaseTensorConversion

Rearranges Q/K/V activations for MQA.

Q output has embed_dim features -> rearrange with n=n_heads. K/V output has head_dim features (1 KV head) -> rearrange with n=1.

handle_conversion(input_value: Tensor, *_: Any) → Tensor¶

revert(input_value: Tensor, *_: Any) → Tensor¶: Revert the conversion. For now, just return the input unchanged.