transformer_lens.model_bridge.supported_architectures package¶
Submodules¶
- transformer_lens.model_bridge.supported_architectures.apertus module
- transformer_lens.model_bridge.supported_architectures.bert module
- transformer_lens.model_bridge.supported_architectures.bloom module
- transformer_lens.model_bridge.supported_architectures.codegen module
- transformer_lens.model_bridge.supported_architectures.cohere module
- transformer_lens.model_bridge.supported_architectures.deepseek_v3 module
- transformer_lens.model_bridge.supported_architectures.falcon module
- transformer_lens.model_bridge.supported_architectures.gemma1 module
- transformer_lens.model_bridge.supported_architectures.gemma2 module
- transformer_lens.model_bridge.supported_architectures.gemma3 module
- transformer_lens.model_bridge.supported_architectures.gemma3_multimodal module
- transformer_lens.model_bridge.supported_architectures.gpt2 module
- transformer_lens.model_bridge.supported_architectures.gpt2_lm_head_custom module
- transformer_lens.model_bridge.supported_architectures.gpt_bigcode module
- transformer_lens.model_bridge.supported_architectures.gpt_oss module
- transformer_lens.model_bridge.supported_architectures.gptj module
- transformer_lens.model_bridge.supported_architectures.granite module
- transformer_lens.model_bridge.supported_architectures.granite_moe module
- transformer_lens.model_bridge.supported_architectures.granite_moe_hybrid module
- transformer_lens.model_bridge.supported_architectures.hubert module
- transformer_lens.model_bridge.supported_architectures.internlm2 module
- transformer_lens.model_bridge.supported_architectures.llama module
- transformer_lens.model_bridge.supported_architectures.llava module
- transformer_lens.model_bridge.supported_architectures.llava_next module
- transformer_lens.model_bridge.supported_architectures.llava_onevision module
- transformer_lens.model_bridge.supported_architectures.mamba module
- transformer_lens.model_bridge.supported_architectures.mamba2 module
- transformer_lens.model_bridge.supported_architectures.mingpt module
- transformer_lens.model_bridge.supported_architectures.mistral module
- transformer_lens.model_bridge.supported_architectures.mixtral module
- transformer_lens.model_bridge.supported_architectures.mpt module
- transformer_lens.model_bridge.supported_architectures.nanogpt module
- transformer_lens.model_bridge.supported_architectures.neel_solu_old module
- transformer_lens.model_bridge.supported_architectures.neo module
- transformer_lens.model_bridge.supported_architectures.neox module
- transformer_lens.model_bridge.supported_architectures.olmo module
- transformer_lens.model_bridge.supported_architectures.olmo2 module
- transformer_lens.model_bridge.supported_architectures.olmo3 module
- transformer_lens.model_bridge.supported_architectures.olmoe module
- transformer_lens.model_bridge.supported_architectures.openelm module
- transformer_lens.model_bridge.supported_architectures.opt module
- transformer_lens.model_bridge.supported_architectures.phi module
- transformer_lens.model_bridge.supported_architectures.phi3 module
- transformer_lens.model_bridge.supported_architectures.pythia module
- transformer_lens.model_bridge.supported_architectures.qwen module
- transformer_lens.model_bridge.supported_architectures.qwen2 module
- transformer_lens.model_bridge.supported_architectures.qwen3 module
- transformer_lens.model_bridge.supported_architectures.qwen3_5 module
- transformer_lens.model_bridge.supported_architectures.qwen3_moe module
- transformer_lens.model_bridge.supported_architectures.qwen3_next module
- transformer_lens.model_bridge.supported_architectures.stablelm module
- transformer_lens.model_bridge.supported_architectures.t5 module
- transformer_lens.model_bridge.supported_architectures.xglm module
Module contents¶
Supported architecture adapters.
This module contains all the supported architecture adapters for different model architectures.
- class transformer_lens.model_bridge.supported_architectures.ApertusArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Apertus models.
Apertus uses a pre-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE with LLaMA-3 scaling), grouped query attention (GQA), non-gated MLP (XiELU activation), and no biases on any projections.
Similar to Qwen3 (pre-norm RMSNorm, QK-norm, GQA, RoPE) but uses a non-gated MLP (up_proj -> XiELU -> down_proj) instead of gated MLP.
Note: Apertus uses different layer norm names than most Llama-family models: - attention_layernorm (instead of input_layernorm) - feedforward_layernorm (instead of post_attention_layernorm)
- __init__(cfg: Any) None¶
Initialize the Apertus architecture adapter.
- prepare_loading(model_name: str, model_kwargs: dict) None¶
Patch XIELUActivation to defer eager .item() calls for meta tensor compat.
Transformers v5 uses meta tensors during from_pretrained, but XIELUActivation.__init__ eagerly calls .item() on beta/eps buffers to precompute _beta_scalar/_eps_scalar for the CUDA kernel path. This fails on meta device. Once upstream fixes this (transformers PR #43473), this patch can be removed.
Instead of reimplementing __init__, we wrap it to catch the meta tensor failure and defer scalar computation to forward() time.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Apertus component testing.
Apertus uses RoPE (Rotary Position Embeddings). We set the rotary_emb on all attention bridge instances for component testing.
We also force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks.
- Parameters:
hf_model – The HuggingFace Apertus model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- class transformer_lens.model_bridge.supported_architectures.BertArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for BERT models.
- __init__(cfg: Any) None¶
Initialize the BERT architecture adapter.
- Parameters:
cfg – The configuration object.
- prepare_model(hf_model: Any) None¶
Adjust component mapping based on the actual HF model variant.
BertForMaskedLM has cls.predictions (MLM head). BertForNextSentencePrediction has cls.seq_relationship (NSP head) and no MLM-specific LayerNorm.
- class transformer_lens.model_bridge.supported_architectures.BloomArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Bloom models.
- __init__(cfg: Any) None¶
Initialize the Bloom architecture adapter.
- split_qkv_matrix(original_attention_component: Any) tuple[Linear, Linear, Linear]¶
Split the QKV matrix into separate linear transformations. :param attention_component: The original attention layer component
- Returns:
Tuple of nn.Linear modules for Q, K, and V transformations
- class transformer_lens.model_bridge.supported_architectures.CodeGenArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for CodeGen models.
CodeGen uses a parallel attention+MLP block (attn and MLP share the same LayerNorm input and their outputs are summed). The attention layer uses a fused
qkv_projweight whose layout follows GPT-J’smp_num=4tensor-parallel partitioning: the rows are interleaved as[Q_part, V_part, K_part]within each of the 4 MP partitions.Optional Parameters (may be absent in some CodeGen checkpoints):¶
No bias on qkv_proj (fused QKV has no bias)
No bias on out_proj
No bias on mlp.fc_in or mlp.fc_out
- __init__(cfg: Any) None¶
Initialize the CodeGen architecture adapter.
- split_qkv_matrix(attn_component: Any) tuple[Linear, Linear, Linear]¶
Split the fused QKV weight into separate Q, K, V linear modules.
CodeGen uses GPT-J-style tensor-parallel partitioning with
mp_num=4partitions. Within each partition the row order is[Q_part, V_part, K_part], i.e. not the conventional Q/K/V order.The fused weight has shape
[3 * n_embd, n_embd]. We reshape to[mp_num, 3, local_dim, n_embd], extract the three slices, then flatten back to[n_embd, n_embd]for each of Q, K, V.- Parameters:
attn_component – The original
CodeGenAttentionmodule.- Returns:
Tuple of
(q_linear, k_linear, v_linear)— threenn.Linearmodules with no bias and weight shape[n_embd, n_embd].
- class transformer_lens.model_bridge.supported_architectures.CohereArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Cohere models (CohereForCausalLM).
Architectural quirks vs. standard decoder-only models: - Single input_layernorm per block; NO post_attention_layernorm.
Attention and MLP both read the SAME normed hidden states (parallel).
CohereLayerNorm is true LayerNorm (mean-subtracting), NOT RMSNorm. It has a weight parameter but NO bias parameter.
Logit scale: CohereForCausalLM.forward multiplies logits by logit_scale (default 0.0625 = 1/16). Folded into unembed.weight via preprocess_weights.
Rotary embeddings use repeat_interleave instead of cat-split (delegated to HF).
Optional parameters (absent from state_dict by default): - blocks.{i}.attn.b_Q/b_K/b_V/b_O — no bias on projections (attention_bias=False) - blocks.{i}.mlp.b_gate/b_in/b_out — no bias on MLP projections - blocks.{i}.ln1.b — CohereLayerNorm has no bias - ln_final.b — CohereLayerNorm has no bias
- __init__(cfg: Any) None¶
Initialize the Cohere architecture adapter.
- preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]¶
Fold logit_scale into unembed weights before ProcessWeights runs.
bridge.py lines 726-732 clone unembed.weight before calling this, so scaling does not affect the tied embed.weight. logit_scale=1.0 is a no-op (skipped for efficiency).
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set rotary embedding reference on attention bridges for component testing.
CohereRotaryEmbedding lives at hf_model.model.rotary_emb. The bridge delegates to it directly, preserving the repeat_interleave RoPE convention without re-implementing it in TL.
Pattern matches llama.py and qwen2.py.
- class transformer_lens.model_bridge.supported_architectures.DeepSeekV3ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for DeepSeek V3 / R1 models.
Uses RMSNorm, MLA with compressed Q/KV projections, partial RoPE, MoE on most layers (dense MLP on first few), and no biases.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for component testing.
- class transformer_lens.model_bridge.supported_architectures.FalconArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Falcon models (FalconForCausalLM).
- prepare_model(hf_model: Any) None¶
Patch Falcon modules to avoid backward hook conflicts.
Two issues: 1. FalconLinear does input @ self.weight.T where .T is a view —
clone the transpose to break the view chain.
FalconDecoderLayer does mlp_output += attention_output (inplace) — this modifies a tensor captured by mlp.hook_out’s backward hook. Patch to use non-inplace addition.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for component testing.
- class transformer_lens.model_bridge.supported_architectures.GPT2ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for GPT2 models.
Optional Parameters (may not exist in state_dict):¶
GPT-2 models HAVE biases on ALL linear layers:
✓ blocks.{i}.attn.b_Q - Has bias (from combined c_attn.bias) ✓ blocks.{i}.attn.b_K - Has bias (from combined c_attn.bias) ✓ blocks.{i}.attn.b_V - Has bias (from combined c_attn.bias) ✓ blocks.{i}.attn.b_O - Has bias (c_proj.bias) ✓ blocks.{i}.mlp.b_in - Has bias (c_fc.bias) ✓ blocks.{i}.mlp.b_out - Has bias (c_proj.bias) ✓ blocks.{i}.ln1.b - LayerNorm has bias ✓ blocks.{i}.ln2.b - LayerNorm has bias ✓ ln_final.b - LayerNorm has bias
No optional parameters - all biases exist in GPT-2.
- __init__(cfg: Any) None¶
Initialize the GPT2 architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.GPTBigCodeArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for GPTBigCode models.
GPTBigCode is a GPT-2 variant using Multi-Query Attention (MQA): a single fused c_attn projection whose output splits asymmetrically into [embed_dim, head_dim, head_dim] for Q/K/V (rather than three equal thirds). All other structure (module paths, LayerNorm, learned pos embeddings, standard MLP) is identical to GPT-2.
All public models use multi_query=True (1 KV head). The adapter assumes MQA throughout.
All linear layers have biases (c_attn, c_proj, c_fc, mlp.c_proj). lm_head has no bias and its weight is tied to transformer.wte.weight.
Weight layout difference from GPT-2: GPTBigCode uses nn.Linear (weights stored [out, in]) rather than GPT-2’s Conv1D ([in, out]), so no unembed weight transpose is needed.
- class transformer_lens.model_bridge.supported_architectures.GPTOSSArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for GPT-OSS model.
- __init__(cfg: Any) None¶
Initialize the GPT-OSS architecture adapter.
- setup_hook_compatibility(bridge_model: Any) None¶
Setup hook compatibility transformations for GPT-OSS models.
This configures rotary embedding references for attention layers, which is needed for models using RoPE (Rotary Position Embeddings).
This is called during Bridge.__init__ and should always be run.
- Parameters:
bridge_model – The TransformerBridge instance
- setup_no_processing_hooks(bridge_model: Any) None¶
Backward compatibility alias for setup_hook_compatibility.
- class transformer_lens.model_bridge.supported_architectures.Gemma1ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Gemma1 models.
- __init__(cfg: Any) None¶
Initialize the Gemma1 architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Gemma1 component testing.
Gemma1 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace Gemma1 model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- setup_hook_compatibility(bridge: Any) None¶
Setup hook compatibility for Gemma1 models.
Gemma1 scales embeddings by sqrt(d_model) in its forward pass, but the HuggingFace embed_tokens layer doesn’t include this scaling. We need to apply it to hook_embed to match HookedTransformer behavior.
- Parameters:
bridge – The TransformerBridge instance
- class transformer_lens.model_bridge.supported_architectures.Gemma2ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Gemma2 models.
- __init__(cfg: Any) None¶
Initialize the Gemma2 architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references and attention implementation for Gemma-2 component testing.
Gemma-2 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
We also force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks, while HF defaults to “sdpa”. These produce mathematically equivalent results but with small numerical differences due to different implementations.
- Parameters:
hf_model – The HuggingFace Gemma-2 model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- setup_hook_compatibility(bridge: Any) None¶
Setup hook compatibility for Gemma2 models.
Gemma2 scales embeddings by sqrt(d_model). The weights are pre-scaled via preprocess_weights(), but we still need to apply the scaling conversion to the hook output for proper hook functionality (so user modifications are correctly scaled/unscaled).
- Parameters:
bridge – The TransformerBridge instance
- class transformer_lens.model_bridge.supported_architectures.Gemma3ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Gemma3 models.
- __init__(cfg: Any) None¶
Initialize the Gemma3 architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references and native autograd for Gemma-3 component testing.
Gemma-3 uses dual RoPE (global + local). We set local RoPE (used by 85% of layers) on all attention bridge instances for component testing.
We also enable use_native_layernorm_autograd on all normalization bridges to ensure they delegate to HuggingFace’s exact implementation instead of using manual computation.
Additionally, we force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks, while HF defaults to “sdpa”. These produce mathematically equivalent results but with small numerical differences due to different implementations.
Note: Layers 5, 11, 17, 23 use global RoPE but will use local in component tests. This is an acceptable tradeoff given the shared-instance constraint.
- Parameters:
hf_model – The HuggingFace Gemma-3 model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- setup_hook_compatibility(bridge: Any) None¶
Setup hook compatibility for Gemma3 models.
Unlike Gemma1/Gemma2, Gemma3 uses Gemma3TextScaledWordEmbedding which scales embeddings by sqrt(d_model) INSIDE the embedding layer’s forward(). Therefore we do NOT need a hook_conversion — the embed.hook_out already captures the scaled output. Adding a conversion would double-scale.
(Gemma1/Gemma2 scale in GemmaModel.forward() AFTER the embedding layer, so their adapters correctly use EmbeddingScaleConversion to match HT.)
- Parameters:
bridge – The TransformerBridge instance
- class transformer_lens.model_bridge.supported_architectures.Gemma3MultimodalArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Gemma3 multimodal models (Gemma3ForConditionalGeneration).
This adapter handles vision-language models like Gemma 3 4B/12B/27B and MedGemma. The model structure is: - model.vision_tower: SigLIP vision encoder - model.multi_modal_projector: Projects vision embeddings to language space - model.language_model: Gemma3TextModel (same as text-only Gemma 3) - lm_head: Output projection
The language model component follows the same patterns as Gemma3ArchitectureAdapter.
- __init__(cfg: Any) None¶
Initialize the Gemma3 multimodal architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Gemma-3 multimodal component testing.
The language model uses dual RoPE (global + local) like text-only Gemma 3.
- Parameters:
hf_model – The HuggingFace Gemma-3 multimodal model instance
bridge_model – The TransformerBridge model (if available)
- setup_hook_compatibility(bridge: Any) None¶
Setup hook compatibility for Gemma3 multimodal models.
Like text-only Gemma 3, the multimodal model uses Gemma3TextScaledWordEmbedding which scales embeddings by sqrt(d_model) internally in its forward() method. No additional hook conversion is needed — adding one would double-scale the embeddings.
- Parameters:
bridge – The TransformerBridge instance
- class transformer_lens.model_bridge.supported_architectures.Gpt2LmHeadCustomArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for GPT-2 LM Head Custom models.
- __init__(cfg: Any) None¶
Initialize the GPT-2 LM Head Custom architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.GptjArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for GPTJ models.
- __init__(cfg: Any) None¶
Initialize the GPTJ architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.GraniteArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for IBM Granite models (dense).
Granite is a Llama-like architecture with RMSNorm, rotary position embeddings (RoPE), GQA, and a gated MLP (SiLU activation). Granite-specific scaling multipliers are handled by the HF model’s native forward pass.
Optional Parameters (may not exist in state_dict):¶
Granite models do NOT have biases on attention and MLP projections:
blocks.{i}.attn.b_Q/b_K/b_V/b_O - No bias on attention projections
blocks.{i}.mlp.b_in/b_gate/b_out - No bias on MLP projections
blocks.{i}.ln1.b, blocks.{i}.ln2.b, ln_final.b - RMSNorm has no bias
- __init__(cfg: Any) None¶
Initialize the Granite architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Granite component testing.
- Parameters:
hf_model – The HuggingFace Granite model instance
bridge_model – The TransformerBridge model (if available)
- class transformer_lens.model_bridge.supported_architectures.GraniteMoeArchitectureAdapter(cfg: Any)¶
Bases:
GraniteArchitectureAdapterArchitecture adapter for IBM Granite MoE models.
Identical to dense Granite but replaces the gated MLP with a Sparse Mixture of Experts block (block_sparse_moe) using batched expert parameters and top-k routing.
- class transformer_lens.model_bridge.supported_architectures.GraniteMoeHybridArchitectureAdapter(cfg: Any)¶
Bases:
GraniteArchitectureAdapterHybrid Mamba2 + Attention with Sparse MoE.
Attention is optional (absent on Mamba layers). shared_mlp and MoE are universal. Inherits Granite config and attention bridge construction.
- class transformer_lens.model_bridge.supported_architectures.HubertArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for HuBERT audio models.
HubertForCTC nests HubertModel under a ‘hubert.’ prefix; prepare_model() detects this and adjusts component paths.
- prepare_loading(model_name: str, model_kwargs: dict) None¶
Propagate HuBERT-specific HF config attributes to bridge config.
Prevents silent-default bugs where adapter reads from bridge config but the attribute was never propagated from HF config.
- prepare_model(hf_model: Any) None¶
Detect HubertForCTC (has ‘hubert.’ prefix) and add CTC head.
- class transformer_lens.model_bridge.supported_architectures.InternLM2ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for InternLM2 models.
InternLM2 uses remote code (trust_remote_code=True) and differs from Llama in: - Fused interleaved GQA wqkv weight (not standard [Q|K|V] split) - Non-standard module names: tok_embeddings, output, attention, feed_forward,
wqkv/wo, w1(gate)/w3(up)/w2(down), attention_norm, ffn_norm
Per-layer rotary_emb (no model-level shared instance)
supports_fold_ln=False: fold_ln is done manually in preprocess_weights because the bridge state dict has the fused qkv key, not split q/k/v keys, so fold_layer_norm’s extract_attention_tensors_for_folding would silently skip attn.
Optional parameters (may not exist in state_dict): - blocks.{i}.attn.b_Q / b_K / b_V / b_O — config.bias=False on shipped models - blocks.{i}.mlp.b_gate / b_in / b_out — MLP always bias=False - blocks.{i}.ln1.b / ln2.b / ln_final.b — RMSNorm has no bias
- prepare_loading(model_name: str, model_kwargs: dict) None¶
Patch transformers v5 incompatibilities before from_pretrained runs.
- preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]¶
Fold layer norms into QKV and MLP weights.
Standard fold_ln can’t reach split Q/K/V when wqkv is fused in the bridge state dict. We extract and fold here, then write split keys so RearrangeTensorConversion can follow. MLP projections (w1/w2/w3) are separate linears so they fold normally. Mirrors phi3.py.preprocess_weights, adapted for InternLM2’s layout.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Inject per-layer rotary embedding for component testing.
- class transformer_lens.model_bridge.supported_architectures.LlamaArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Llama models.
Optional Parameters (may not exist in state_dict):¶
LLaMA models do NOT have biases on attention and MLP projections:
blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP input (up_proj)
blocks.{i}.mlp.b_gate - No bias on MLP gate projection
blocks.{i}.mlp.b_out - No bias on MLP output (down_proj)
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias
Weight processing must handle these missing biases gracefully using ProcessWeights._safe_get_tensor() or by checking for None values.
- __init__(cfg: Any) None¶
Initialize the Llama architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Llama component testing.
Llama uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace Llama model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- class transformer_lens.model_bridge.supported_architectures.LlavaArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for LLava multimodal models (LlavaForConditionalGeneration).
This adapter handles vision-language models like LLava 1.5. The model structure is: - model.vision_tower: CLIP vision encoder - model.multi_modal_projector: 2-layer MLP (Linear -> GELU -> Linear) - model.language_model: LlamaForCausalLM
model.language_model.model.embed_tokens
model.language_model.model.layers[]: LLaMA transformer blocks
model.language_model.model.norm
model.language_model.lm_head
The language model component follows the same patterns as LlamaArchitectureAdapter.
- __init__(cfg: Any) None¶
Initialize the LLava architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for LLava component testing.
LLava uses a LLaMA language backbone with RoPE. We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace LLava model instance
bridge_model – The TransformerBridge model (if available)
- class transformer_lens.model_bridge.supported_architectures.LlavaNextArchitectureAdapter(cfg: Any)¶
Bases:
LlavaArchitectureAdapterArchitecture adapter for LLaVA-NeXT (1.6) models.
- class transformer_lens.model_bridge.supported_architectures.LlavaOnevisionArchitectureAdapter(cfg: Any)¶
Bases:
LlavaArchitectureAdapterArchitecture adapter for LLaVA-OneVision models.
- prepare_model(hf_model: Any) None¶
Fix weight tying when text_config and top-level config disagree.
Some checkpoints have tie_word_embeddings=True in text_config but False at the top level, leaving lm_head randomly initialized.
- class transformer_lens.model_bridge.supported_architectures.MPTArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterMPT adapter: ALiBi bias; all layers bias-free (no b_Q/b_K/b_V/b_O/b_in/b_out/ln bias).
- class transformer_lens.model_bridge.supported_architectures.Mamba2ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterWraps HF’s Mamba2ForCausalLM.
Differs from Mamba-1 at the mixer level: fused in_proj (no x_proj/dt_proj), two-input inner norm, multi-head structure with
num_heads/head_dim/n_groups, and an[num_heads]-shapeddt_bias. SharesSSMBlockBridge,DepthwiseConv1DBridge, and the stateful generation loop with Mamba-1.- applicable_phases: list[int] = []¶
- create_stateful_cache(hf_model: Any, batch_size: int, device: Any, dtype: dtype) Any¶
Build a Mamba2Cache for the stateful generation loop.
- class transformer_lens.model_bridge.supported_architectures.MambaArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterWraps HF’s MambaForCausalLM. No attention, no positional embeddings.
SSM config fields (state_size, conv_kernel, expand, time_step_rank, intermediate_size) are propagated from the HF config via
_HF_PASSTHROUGH_ATTRSin sources/transformers.py.- applicable_phases: list[int] = []¶
- create_stateful_cache(hf_model: Any, batch_size: int, device: Any, dtype: dtype) Any¶
Build a MambaCache for the stateful generation loop.
- class transformer_lens.model_bridge.supported_architectures.MingptArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for MinGPT models.
- __init__(cfg: Any) None¶
Initialize the MinGPT architecture adapter.
- Parameters:
cfg – The configuration object.
- class transformer_lens.model_bridge.supported_architectures.MistralArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Mistral models.
- __init__(cfg: Any) None¶
Initialize the Mistral architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.MixtralArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Mixtral models.
Mixtral uses a pre-norm architecture with RMSNorm, rotary position embeddings (RoPE), and a Sparse Mixture of Experts MLP. Key features:
Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Rotary embeddings: stored at model.rotary_emb and passed per-forward-call.
Sparse MoE: batched expert parameters (gate_up_proj, down_proj as 3D tensors).
MixtralAttention.forward() requires position_embeddings and attention_mask args.
Optional GQA (n_key_value_heads may differ from n_heads).
- __init__(cfg: Any) None¶
Initialize the Mixtral architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Mixtral component testing.
Mixtral uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace Mixtral model instance
bridge_model – The TransformerBridge model (if available)
- class transformer_lens.model_bridge.supported_architectures.NanogptArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for NanoGPT models.
- __init__(cfg: Any) None¶
Initialize the NanoGPT architecture adapter.
- Parameters:
cfg – The configuration object.
- convert_weights(remote_module: Any) dict[str, Tensor]¶
- class transformer_lens.model_bridge.supported_architectures.NeelSoluOldArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Neel’s SOLU models (old style).
- __init__(cfg: Any) None¶
Initialize the Neel SOLU old-style architecture adapter.
- Parameters:
cfg – The configuration object.
- class transformer_lens.model_bridge.supported_architectures.NeoArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Neo models.
- __init__(cfg: Any) None¶
Initialize the Neo architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.NeoxArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for NeoX models.
- __init__(cfg: Any) None¶
Initialize the NeoX architecture adapter.
- Parameters:
cfg – The configuration object.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for GPT-NeoX/StableLM component testing.
GPT-NeoX models use RoPE (Rotary Position Embeddings) which need to be set on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace GPT-NeoX model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- split_qkv_matrix(original_attention_component: Any) tuple[Linear, Linear, Linear]¶
Split the QKV matrix into separate linear transformations.
GPT-NeoX/StableLM uses an interleaved QKV format where the weights are stored as [Q_h0, K_h0, V_h0, Q_h1, K_h1, V_h1, …] - i.e., Q, K, V are interleaved per head.
The weight shape is [n_heads * 3 * d_head, d_model] and the output is reshaped by HuggingFace as [batch, seq, n_heads, 3*d_head] then split on the last dim.
- Parameters:
original_attention_component – The original attention layer component
- Returns:
Tuple of nn.Linear modules for Q, K, and V transformations
- class transformer_lens.model_bridge.supported_architectures.Olmo2ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for OLMo 2 models.
OLMo 2 uses a post-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE), and gated MLP (SwiGLU). Key differences from pre-norm models like Llama:
Post-norm: RMSNorm is applied AFTER attention and AFTER MLP, not before. ln1 maps to post_attention_layernorm, ln2 maps to post_feedforward_layernorm.
Q/K normalization: Per-head RMSNorm applied to queries and keys after projection.
No biases on any projections.
Optional Parameters (may not exist in state_dict):¶
blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP up_proj
blocks.{i}.mlp.b_gate - No bias on MLP gate_proj
blocks.{i}.mlp.b_out - No bias on MLP down_proj
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias
- __init__(cfg: Any) None¶
Initialize the OLMo 2 architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for OLMo 2 component testing.
OLMo 2 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
We also force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks.
- Parameters:
hf_model – The HuggingFace OLMo 2 model instance
bridge_model – The TransformerBridge model (if available)
- class transformer_lens.model_bridge.supported_architectures.Olmo3ArchitectureAdapter(cfg: Any)¶
Bases:
Olmo2ArchitectureAdapterArchitecture adapter for OLMo 3 / OLMo 3.1 models.
OLMo 3 is architecturally identical to OLMo 2 at the weight and component level. The only difference is sliding window attention on some layers (configurable via layer_types), which is handled by the HF model’s forward pass (mask creation) and does not affect weight structure or component mapping.
- class transformer_lens.model_bridge.supported_architectures.OlmoArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for OLMo (v1) models.
OLMo v1 uses a pre-norm architecture with a custom non-learnable LayerNorm (fixed weight=1, bias=0), rotary position embeddings (RoPE), and gated MLP (SwiGLU). Key differences from later OLMo variants:
Pre-norm: LayerNorm is applied BEFORE attention and BEFORE MLP.
Non-learnable LayerNorm: Weight and bias are not trainable parameters. Delegating to HF’s native forward via NormalizationBridge handles this correctly.
No Q/K normalization in attention.
Optional QKV clipping (handled by HF’s native attention forward).
Optional Parameters (may not exist in state_dict):¶
blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP up_proj
blocks.{i}.mlp.b_gate - No bias on MLP gate_proj
blocks.{i}.mlp.b_out - No bias on MLP down_proj
- __init__(cfg: Any) None¶
Initialize the OLMo architecture adapter.
- prepare_model(hf_model: Any) None¶
Patch OLMo’s in-place clamp_ to avoid backward hook conflicts.
OLMo v1 uses query_states.clamp_() when config.clip_qkv is set. In-place ops on tensors that pass through register_full_backward_hook trigger PyTorch’s “view modified inplace” error. This patch disables the in-place clamp branch during attention forward passes.
Note: clip_qkv clamping is skipped in the patched forward. In practice clip_qkv values (typically 100+) rarely activate. If exact clamping is needed, add out-of-place clamp hooks on hook_q/hook_k/hook_v.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for OLMo component testing.
OLMo uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace OLMo model instance
bridge_model – The TransformerBridge model (if available)
- class transformer_lens.model_bridge.supported_architectures.OlmoeArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for OLMoE (Mixture of Experts) models.
OLMoE uses a pre-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE), and sparse Mixture of Experts MLP. Key features:
Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Q/K normalization: RMSNorm applied to queries and keys after projection.
Sparse MoE: 64 experts with top-8 routing (configurable).
Batched expert parameters: gate_up_proj [num_experts, 2*d_mlp, d_model] and down_proj [num_experts, d_model, d_mlp] as single tensors, not a ModuleList.
Optional QKV clipping (handled by HF’s native attention forward).
No biases on any projections.
Optional Parameters (may not exist in state_dict):¶
blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias
- __init__(cfg: Any) None¶
Initialize the OLMoE architecture adapter.
- prepare_model(hf_model: Any) None¶
Patch OLMoE’s in-place clamp_ to avoid backward hook conflicts.
Same issue as OLMo v1 — see OlmoArchitectureAdapter.prepare_model.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for OLMoE component testing.
OLMoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace OLMoE model instance
bridge_model – The TransformerBridge model (if available)
- class transformer_lens.model_bridge.supported_architectures.OpenElmArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Apple OpenELM models.
OpenELM uses a unique architecture with per-layer varying head counts and FFN dimensions. Key characteristics:
Combined QKV projection (qkv_proj) with per-layer varying Q/KV head counts
Gated MLP with combined gate+up projection (proj_1) and per-layer FFN sizes
RMSNorm normalization
Full rotary embeddings (per-layer, not shared)
Optional Q/K RMSNorm (normalize_qk_projections=True)
Weight tying (share_input_output_layers=True typically)
Model root is ‘transformer’ (not ‘model’)
Requires trust_remote_code=True (custom HF code)
The native HF attention handles all per-layer dimension variations, RoPE, GQA group repeat, and Q/K normalization internally. The bridge delegates to the native forward for correct computation.
Note: Individual Q/K/V hooks are not available since the model uses a combined QKV projection. Attention-level hooks (hook_attn_in, hook_attn_out) are provided.
- __init__(cfg: Any) None¶
Initialize the OpenELM architecture adapter.
- prepare_loading(model_name: str, model_kwargs: dict) None¶
Patch OpenELM for compatibility with transformers v5.
Two patches are needed: 1. RotaryEmbedding: Custom _compute_sin_cos_embeddings fails on meta device
because it calls .cos() on meta tensors. We wrap it to catch NotImplementedError.
Weight re-initialization: OpenELM’s _init_weights re-randomizes ALL weights after they’ve been loaded from safetensors because transformers v5’s _finalize_load_state_dict calls initialize_weights() on modules lacking the _is_hf_initialized flag. We patch _init_weights to skip real (non-meta) tensors.
- Parameters:
model_name – The HuggingFace model name/path
model_kwargs – The kwargs dict for from_pretrained()
- prepare_model(hf_model: Any) None¶
Post-load fixes for non-persistent buffers zeroed during meta materialization.
Transformers v5 creates models on meta device then materializes weights from checkpoint. Non-persistent buffers (registered with persistent=False) are NOT in the checkpoint, so they materialize as zeros. OpenELM has two critical non-persistent buffers that must be recomputed:
RoPE inv_freq — zeroed inv_freq produces cos=1, sin=0 for all positions, destroying positional information entirely.
causal_mask — zeroed mask means no causal masking, allowing all positions to attend to future tokens. Single forward passes appear correct (no future tokens to leak) but autoregressive generation degenerates immediately.
We also create a synthetic lm_head for weight-tied models.
Note: We intentionally do NOT restore the original _compute_sin_cos_embeddings. The safe_compute wrapper is functionally equivalent for real (non-meta) tensors, and keeping it avoids issues when multiple models are loaded in the same process (e.g., benchmark suite loading both HF reference and bridge models).
- Parameters:
hf_model – The loaded HuggingFace OpenELM model
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up references for OpenELM component testing.
- Parameters:
hf_model – The HuggingFace OpenELM model instance
bridge_model – The TransformerBridge model (if available)
- class transformer_lens.model_bridge.supported_architectures.OptArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for OPT models.
- __init__(cfg: Any) None¶
Initialize the OPT architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.Phi3ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Phi-3 models.
- __init__(cfg: Any) None¶
Initialize the Phi-3 architecture adapter.
- Parameters:
cfg – The configuration object.
- prepare_loading(model_name: str, model_kwargs: dict) None¶
Patch cached Phi-3 remote code for transformers v5 compatibility.
- preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]¶
Fold layer norms into joint QKV/gate_up projections.
Standard fold_ln can’t handle joint projections (shape mismatch on round-trip), so we scale the full joint weights directly.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Phi-3 component testing.
- Parameters:
hf_model – The HuggingFace Phi-3 model instance
bridge_model – The TransformerBridge model (if available)
- class transformer_lens.model_bridge.supported_architectures.PhiArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Phi models.
- __init__(cfg: Any) None¶
Initialize the Phi architecture adapter.
- Parameters:
cfg – The configuration object.
- default_cfg: dict[str, Any] = {'use_fast': False}¶
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Phi component testing.
Phi uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace Phi model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- class transformer_lens.model_bridge.supported_architectures.PythiaArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Pythia models.
- __init__(cfg: Any) None¶
Initialize the Pythia architecture adapter.
- Parameters:
cfg – The configuration object.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Pythia component testing.
Pythia uses RoPE (Rotary Position Embeddings) in the GPT-NeoX architecture. We need to set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace Pythia model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- split_qkv_matrix(original_attention_component: Any) tuple[Linear, Linear, Linear]¶
Split the QKV matrix into separate linear transformations.
GPT-NeoX/Pythia uses an interleaved QKV format where the weights are stored as [Q_h0, K_h0, V_h0, Q_h1, K_h1, V_h1, …] - i.e., Q, K, V are interleaved per head.
The weight shape is [n_heads * 3 * d_head, d_model] and the output is reshaped by HuggingFace as [batch, seq, n_heads, 3*d_head] then split on the last dim.
- Parameters:
original_attention_component – The original attention layer component
- Returns:
Tuple of nn.Linear modules for Q, K, and V transformations
- class transformer_lens.model_bridge.supported_architectures.Qwen2ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Qwen2 models.
Optional Parameters (may not exist in state_dict):¶
Qwen2 models do NOT have biases on any linear layers:
blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP input (up_proj)
blocks.{i}.mlp.b_gate - No bias on MLP gate projection
blocks.{i}.mlp.b_out - No bias on MLP output (down_proj)
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias
Weight processing must handle these missing biases gracefully using ProcessWeights._safe_get_tensor() or by checking for None values.
- __init__(cfg: Any) None¶
Initialize the Qwen2 architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Qwen2 component testing.
Qwen2 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace Qwen2 model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- class transformer_lens.model_bridge.supported_architectures.Qwen3ArchitectureAdapter(cfg: Any, *, hybrid: bool = False)¶
Bases:
ArchitectureAdapterArchitecture adapter for Qwen3 dense models.
RMSNorm, RoPE, GQA, Q/K head norms, gated MLP. No biases. Serves as base class for Qwen3.5 and Qwen3Next hybrid variants.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set eager attn on HF model and rotary_emb on attention bridges.
- class transformer_lens.model_bridge.supported_architectures.Qwen3MoeArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Qwen3MoE (Mixture of Experts) models.
Qwen3MoE is a sparse MoE decoder-only Transformer, structurally close to OLMoE. Key features:
Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Q/K normalization: RMSNorm applied to queries and keys after projection.
Sparse MoE: 128 experts with top-8 routing (public 30B-A3B checkpoints).
Batched expert parameters: gate_up_proj and down_proj as single 3D tensors, not a ModuleList.
final_rms=True (Qwen3-style; OLMoE uses False).
No biases on any projections.
GQA: n_key_value_heads < n_heads in all public checkpoints.
Only the all-MoE configuration is supported (decoder_sparse_step=1, mlp_only_layers=[]). Models with dense fallback layers cannot be wrapped because MoEBridge does not handle the dense Qwen3MoeMLP path.
Optional Parameters (may not exist in state_dict):¶
blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias
- __init__(cfg: Any) None¶
Initialize the Qwen3MoE architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Qwen3MoE component testing.
Qwen3MoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace Qwen3MoE model instance
bridge_model – The TransformerBridge model (if available)
- class transformer_lens.model_bridge.supported_architectures.Qwen3NextArchitectureAdapter(cfg: Any)¶
Bases:
Qwen3ArchitectureAdapterHybrid linear-attention + full-attention with sparse MoE MLP.
Same hybrid design as Qwen3.5 but with MoE instead of dense MLP.
- preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]¶
Slice query half from gated q_proj.weight for weight-space analysis.
- class transformer_lens.model_bridge.supported_architectures.Qwen3_5ArchitectureAdapter(cfg: Any)¶
Bases:
Qwen3ArchitectureAdapterHybrid linear-attention + full-attention with dense gated MLP.
Inherits Qwen3 config/attention/MLP structure. Differences: - Attention + linear_attn are optional (per-layer type) - Gated q_proj (2x wide) sliced by preprocess_weights for weight analysis
- prepare_loading(model_name: str, model_kwargs: dict) None¶
Swap multimodal Qwen3_5Config for text-only Qwen3_5TextConfig.
Published checkpoints carry architectures=[‘Qwen3_5ForConditionalGeneration’]. We replace config with text_config so AutoModelForCausalLM loads the text-only Qwen3_5ForCausalLM.
- preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]¶
Slice query half from gated q_proj.weight for weight-space analysis.
In processed mode, W_Q is the pure query projection (for composition scores, logit lens). Gate signal available in unprocessed mode on full-attention layers via blocks.N.attn.hook_q_gate.
- class transformer_lens.model_bridge.supported_architectures.QwenArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Qwen models.
- __init__(cfg: Any) None¶
Initialize the Qwen architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.StableLmArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for StableLM models.
StableLM uses a Llama-like architecture with separate Q/K/V projections and gated MLP, but differs in using standard LayerNorm (not RMSNorm) and partial rotary embeddings (25% of head dimensions by default).
Supports optional features: - Grouped Query Attention (num_key_value_heads != num_attention_heads) - QKV bias (use_qkv_bias=True on some models like stable-code-3b) - Parallel residual connections (use_parallel_residual=True) - Per-head QK LayerNorm (qk_layernorm=True)
Optional Parameters (may not exist in state_dict):¶
blocks.{i}.attn.b_Q - Only present when use_qkv_bias=True
blocks.{i}.attn.b_K - Only present when use_qkv_bias=True
blocks.{i}.attn.b_V - Only present when use_qkv_bias=True
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP up_proj
blocks.{i}.mlp.b_gate - No bias on MLP gate_proj
blocks.{i}.mlp.b_out - No bias on MLP down_proj
- __init__(cfg: Any) None¶
Initialize the StableLM architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for StableLM component testing.
StableLM uses RoPE (Rotary Position Embeddings) with partial rotation. We set the rotary_emb reference on all attention bridge instances and force eager attention for numerical consistency.
- Parameters:
hf_model – The HuggingFace StableLM model instance
bridge_model – The TransformerBridge model (if available)
- setup_hook_compatibility(bridge: Any) None¶
Inject hook points for QK LayerNorm on models with qk_layernorm=True.
StableLM v2 models (e.g., stablelm-2-12b) apply per-head LayerNorm to Q and K after projection but before rotary embedding. The native HF attention handles this internally, but we inject hooks so researchers can observe/intervene on the post-norm Q/K values.
- Adds to each attention bridge:
hook_q_layernorm: fires after q_layernorm(query_states)
hook_k_layernorm: fires after k_layernorm(key_states)
This runs during bridge __init__ via _setup_hook_compatibility(), after component setup but before hook registry finalization. The hook registry scanner skips _original_component subtrees, so we register hooks directly in bridge._hook_registry with canonical TL-style names.
- Parameters:
bridge – The TransformerBridge instance (fully initialized)
- class transformer_lens.model_bridge.supported_architectures.T5ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for T5 models.
T5 is an encoder-decoder model with: - Shared embeddings - Encoder stack (self-attention + FFN) - Decoder stack (self-attention + cross-attention + FFN) - Language modeling head
Supports both standard T5 (DenseReluDense with wi/wo) and gated variants like Flan-T5 (T5DenseGatedActDense with wi_0/wi_1/wo).
- __init__(cfg: Any) None¶
Initialize the T5 architecture adapter.
- Parameters:
cfg – The configuration object.
- class transformer_lens.model_bridge.supported_architectures.XGLMArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for XGLM models.
XGLM uses pre-norm LayerNorm, sinusoidal positional embeddings (no learnable weights), standard MHA with separate q/k/v/out_proj, and a 2-layer MLP (fc1/fc2) that lives directly on the decoder block rather than inside an mlp sub-module.
All attention projections and fc1/fc2 carry biases. lm_head has no bias. Embeddings are scaled by sqrt(d_model) at runtime in XGLMScaledWordEmbedding.
Optional Parameters (may not exist in state_dict):¶
None — all published XGLM checkpoints include all parameters listed above.
- __init__(cfg: Any) None¶
Initialize the XGLM architecture adapter.
- setup_hook_compatibility(bridge: Any) None¶
Scale hook_embed by sqrt(d_model) to match XGLMScaledWordEmbedding.forward().
XGLMScaledWordEmbedding multiplies the embedding lookup by embed_scale = sqrt(d_model) at runtime. Without this override, hook_embed would capture the raw (unscaled) table output, diverging from actual model activations.