transformer_lens.model_bridge.supported_architectures.gpt_bigcode module¶
GPTBigCode architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.gpt_bigcode.GPTBigCodeArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for GPTBigCode models.
GPTBigCode is a GPT-2 variant using Multi-Query Attention (MQA): a single fused c_attn projection whose output splits asymmetrically into [embed_dim, head_dim, head_dim] for Q/K/V (rather than three equal thirds). All other structure (module paths, LayerNorm, learned pos embeddings, standard MLP) is identical to GPT-2.
All public models use multi_query=True (1 KV head). The adapter assumes MQA throughout.
All linear layers have biases (c_attn, c_proj, c_fc, mlp.c_proj). lm_head has no bias and its weight is tied to transformer.wte.weight.
Weight layout difference from GPT-2: GPTBigCode uses nn.Linear (weights stored [out, in]) rather than GPT-2’s Conv1D ([in, out]), so no unembed weight transpose is needed.
- class transformer_lens.model_bridge.supported_architectures.gpt_bigcode.MQAQKVConversionRule(n_heads: int, d_head: int)¶
Bases:
BaseTensorConversionRearranges Q/K/V activations for MQA.
Q output has embed_dim features -> rearrange with n=n_heads. K/V output has head_dim features (1 KV head) -> rearrange with n=1.
- handle_conversion(input_value: Tensor, *_: Any) Tensor¶
- revert(input_value: Tensor, *_: Any) Tensor¶
Revert the conversion. For now, just return the input unchanged.