transformer_lens.model_bridge.supported_architectures package¶

Submodules¶

Module contents¶

Supported architecture adapters.

This module contains all the supported architecture adapters for different model architectures.

class transformer_lens.model_bridge.supported_architectures.ApertusArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Apertus models.

Apertus uses a pre-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE with LLaMA-3 scaling), grouped query attention (GQA), non-gated MLP (XiELU activation), and no biases on any projections.

Similar to Qwen3 (pre-norm RMSNorm, QK-norm, GQA, RoPE) but uses a non-gated MLP (up_proj -> XiELU -> down_proj) instead of gated MLP.

Note: Apertus uses different layer norm names than most Llama-family models: - attention_layernorm (instead of input_layernorm) - feedforward_layernorm (instead of post_attention_layernorm)

__init__(cfg: Any) → None¶: Initialize the Apertus architecture adapter.

prepare_loading(model_name: str, model_kwargs: dict) → None¶

Patch XIELUActivation to defer eager .item() calls for meta tensor compat.

Transformers v5 uses meta tensors during from_pretrained, but XIELUActivation.__init__ eagerly calls .item() on beta/eps buffers to precompute _beta_scalar/_eps_scalar for the CUDA kernel path. This fails on meta device. Once upstream fixes this (transformers PR #43473), this patch can be removed.

Instead of reimplementing __init__, we wrap it to catch the meta tensor failure and defer scalar computation to forward() time.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Apertus component testing.

Apertus uses RoPE (Rotary Position Embeddings). We set the rotary_emb on all attention bridge instances for component testing.

We also force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks.

Parameters:

hf_model – The HuggingFace Apertus model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.BaichuanArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Baichuan models (v1 and v2).

Baichuan uses combined QKV via W_pack (nn.Linear(h, 3*h)) with RoPE, RMSNorm, and gated MLP (SwiGLU). Per-layer rotary embeddings.

Optional Parameters (may not exist in state_dict):¶

Baichuan models do NOT have biases on any projection:

blocks.{i}.attn.b_Q / b_K / b_V / b_O — no bias
blocks.{i}.mlp.b_gate / b_in / b_out — no bias
blocks.{i}.ln1.b / ln2.b / ln_final.b — RMSNorm has no bias

prepare_loading(model_name: str, model_kwargs: dict) → None¶: Patch transformers v5 incompatibilities before from_pretrained runs.

prepare_model(hf_model: Any) → None¶

Fix rotary caches and normalize NormHead weights before bridge creation.

RotaryEmbedding differs between v1 and v2: - v1 (Baichuan-7B): inv_freq is a persistent buffer, loaded from the

checkpoint as bfloat16, but cos_cached/sin_cached are non-persistent and materialize as garbage under meta-init.

v2 (Baichuan2-*): inv_freq, cos_cached, sin_cached are all plain attributes (no register_buffer). v5’s meta-init materializes them on meta, and nothing in the checkpoint overwrites them.

Both cases are resolved by computing inv_freq + caches from scratch at float32 using config-derived head_dim and base=10000. Recomputing v1 at float32 is also an upgrade over its bfloat16 checkpoint values.

Baichuan2 Chat also uses NormHead which row-normalizes lm_head during forward. We apply that once here so the bridge sees the normalized weights directly without needing NormHead’s forward path.

preprocess_weights(state_dict: dict[str, Tensor]) → dict[str, Tensor]¶: Split fused W_pack QKV and optionally fold layer norms.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Inject per-layer rotary embedding for component testing.

class transformer_lens.model_bridge.supported_architectures.BartArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for BartForConditionalGeneration models.

__init__(cfg: Any) → None¶: Initialize the BART architecture adapter.

class transformer_lens.model_bridge.supported_architectures.BertArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for BERT models.

__init__(cfg: Any) → None¶

Initialize the BERT architecture adapter.

Parameters:: cfg – The configuration object.

prepare_model(hf_model: Any) → None¶

Adjust component mapping based on the actual HF model variant.

BertForMaskedLM has cls.predictions (MLM head). BertForNextSentencePrediction has cls.seq_relationship (NSP head) and no MLM-specific LayerNorm.

supports_generation: bool = False¶

class transformer_lens.model_bridge.supported_architectures.BloomArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Bloom models.

__init__(cfg: Any) → None¶: Initialize the Bloom architecture adapter.

split_qkv_matrix(original_attention_component: Any) → tuple[Linear, Linear, Linear]¶

Split the QKV matrix into separate linear transformations. :param attention_component: The original attention layer component

Returns:: Tuple of nn.Linear modules for Q, K, and V transformations

class transformer_lens.model_bridge.supported_architectures.CodeGenArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for CodeGen models.

CodeGen uses a parallel attention+MLP block (attn and MLP share the same LayerNorm input and their outputs are summed). The attention layer uses a fused qkv_proj weight whose layout follows GPT-J’s mp_num=4 tensor-parallel partitioning: the rows are interleaved as [Q_part, V_part, K_part] within each of the 4 MP partitions.

Optional Parameters (may be absent in some CodeGen checkpoints):¶

No bias on qkv_proj (fused QKV has no bias)
No bias on out_proj
No bias on mlp.fc_in or mlp.fc_out

__init__(cfg: Any) → None¶: Initialize the CodeGen architecture adapter.

split_qkv_matrix(attn_component: Any) → tuple[Linear, Linear, Linear]¶

Split the fused QKV weight into separate Q, K, V linear modules.

CodeGen uses GPT-J-style tensor-parallel partitioning with mp_num=4 partitions. Within each partition the row order is [Q_part, V_part, K_part], i.e. not the conventional Q/K/V order.

The fused weight has shape [3 * n_embd, n_embd]. We reshape to [mp_num, 3, local_dim, n_embd], extract the three slices, then flatten back to [n_embd, n_embd] for each of Q, K, V.

Parameters:: attn_component – The original CodeGenAttention module.
Returns:: Tuple of (q_linear, k_linear, v_linear) — three nn.Linear modules with no bias and weight shape [n_embd, n_embd].

class transformer_lens.model_bridge.supported_architectures.CohereArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Cohere models (CohereForCausalLM).

Architectural quirks vs. standard decoder-only models: - Single input_layernorm per block; NO post_attention_layernorm.

Attention and MLP both read the SAME normed hidden states (parallel).

CohereLayerNorm is true LayerNorm (mean-subtracting), NOT RMSNorm. It has a weight parameter but NO bias parameter.
Logit scale: CohereForCausalLM.forward multiplies logits by logit_scale (default 0.0625 = 1/16). Folded into unembed.weight via preprocess_weights.
Rotary embeddings use repeat_interleave instead of cat-split (delegated to HF).

Optional parameters (absent from state_dict by default): - blocks.{i}.attn.b_Q/b_K/b_V/b_O — no bias on projections (attention_bias=False) - blocks.{i}.mlp.b_gate/b_in/b_out — no bias on MLP projections - blocks.{i}.ln1.b — CohereLayerNorm has no bias - ln_final.b — CohereLayerNorm has no bias

__init__(cfg: Any) → None¶: Initialize the Cohere architecture adapter.

preprocess_weights(state_dict: dict[str, Tensor]) → dict[str, Tensor]¶

Fold logit_scale into unembed weights before ProcessWeights runs.

bridge.py lines 726-732 clone unembed.weight before calling this, so scaling does not affect the tied embed.weight. logit_scale=1.0 is a no-op (skipped for efficiency).

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set rotary embedding reference on attention bridges for component testing.

CohereRotaryEmbedding lives at hf_model.model.rotary_emb. The bridge delegates to it directly, preserving the repeat_interleave RoPE convention without re-implementing it in TL.

Pattern matches llama.py and qwen2.py.

class transformer_lens.model_bridge.supported_architectures.DeepSeekV2ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for DeepSeek V2 / V2-Lite / Coder-V2 models.

Uses RMSNorm, MLA with compressed Q/KV projections (or direct Q projection when q_lora_rank is None), partial RoPE, MoE on most layers (dense MLP on first few), and no biases.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Set up rotary embedding references for component testing.

class transformer_lens.model_bridge.supported_architectures.DeepSeekV3ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for DeepSeek V3 / R1 models.

Uses RMSNorm, MLA with compressed Q/KV projections, partial RoPE, MoE on most layers (dense MLP on first few), and no biases.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Set up rotary embedding references for component testing.

class transformer_lens.model_bridge.supported_architectures.FalconArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Falcon models (FalconForCausalLM).

prepare_model(hf_model: Any) → None¶

Patch Falcon modules to avoid backward hook conflicts.

Two issues: 1. FalconLinear does input @ self.weight.T where .T is a view —

clone the transpose to break the view chain.

FalconDecoderLayer does mlp_output += attention_output (inplace) — this modifies a tensor captured by mlp.hook_out’s backward hook. Patch to use non-inplace addition.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Set up rotary embedding references for component testing.

class transformer_lens.model_bridge.supported_architectures.GPT2ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for GPT2 models.

Optional Parameters (may not exist in state_dict):¶

GPT-2 models HAVE biases on ALL linear layers:

✓ blocks.{i}.attn.b_Q - Has bias (from combined c_attn.bias) ✓ blocks.{i}.attn.b_K - Has bias (from combined c_attn.bias) ✓ blocks.{i}.attn.b_V - Has bias (from combined c_attn.bias) ✓ blocks.{i}.attn.b_O - Has bias (c_proj.bias) ✓ blocks.{i}.mlp.b_in - Has bias (c_fc.bias) ✓ blocks.{i}.mlp.b_out - Has bias (c_proj.bias) ✓ blocks.{i}.ln1.b - LayerNorm has bias ✓ blocks.{i}.ln2.b - LayerNorm has bias ✓ ln_final.b - LayerNorm has bias

No optional parameters - all biases exist in GPT-2.

__init__(cfg: Any) → None¶: Initialize the GPT2 architecture adapter.

class transformer_lens.model_bridge.supported_architectures.GPTBigCodeArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for GPTBigCode models.

GPTBigCode is a GPT-2 variant using Multi-Query Attention (MQA): a single fused c_attn projection whose output splits asymmetrically into [embed_dim, head_dim, head_dim] for Q/K/V (rather than three equal thirds). All other structure (module paths, LayerNorm, learned pos embeddings, standard MLP) is identical to GPT-2.

All public models use multi_query=True (1 KV head). The adapter assumes MQA throughout.

All linear layers have biases (c_attn, c_proj, c_fc, mlp.c_proj). lm_head has no bias and its weight is tied to transformer.wte.weight.

Weight layout difference from GPT-2: GPTBigCode uses nn.Linear (weights stored [out, in]) rather than GPT-2’s Conv1D ([in, out]), so no unembed weight transpose is needed.

class transformer_lens.model_bridge.supported_architectures.GPTOSSArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for GPT-OSS model.

__init__(cfg: Any) → None¶: Initialize the GPT-OSS architecture adapter.

setup_hook_compatibility(bridge_model: Any) → None¶

Setup hook compatibility transformations for GPT-OSS models.

This configures rotary embedding references for attention layers, which is needed for models using RoPE (Rotary Position Embeddings).

This is called during Bridge.__init__ and should always be run.

Parameters:: bridge_model – The TransformerBridge instance

setup_no_processing_hooks(bridge_model: Any) → None¶: Backward compatibility alias for setup_hook_compatibility.

class transformer_lens.model_bridge.supported_architectures.Gemma1ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Gemma1 models.

__init__(cfg: Any) → None¶: Initialize the Gemma1 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Gemma1 component testing.

Gemma1 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace Gemma1 model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.Gemma2ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Gemma2 models.

__init__(cfg: Any) → None¶: Initialize the Gemma2 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references and attention implementation for Gemma-2 component testing.

Gemma-2 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

We also force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks, while HF defaults to “sdpa”. These produce mathematically equivalent results but with small numerical differences due to different implementations.

Parameters:

hf_model – The HuggingFace Gemma-2 model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.Gemma3ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Gemma3 models.

__init__(cfg: Any) → None¶: Initialize the Gemma3 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references and native autograd for Gemma-3 component testing.

Gemma-3 uses dual RoPE (global + local). We set local RoPE (used by 85% of layers) on all attention bridge instances for component testing.

We also enable use_native_layernorm_autograd on all normalization bridges to ensure they delegate to HuggingFace’s exact implementation instead of using manual computation.

Additionally, we force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks, while HF defaults to “sdpa”. These produce mathematically equivalent results but with small numerical differences due to different implementations.

Note: Layers 5, 11, 17, 23 use global RoPE but will use local in component tests. This is an acceptable tradeoff given the shared-instance constraint.

Parameters:

hf_model – The HuggingFace Gemma-3 model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.Gemma3MultimodalArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Gemma3 multimodal models (Gemma3ForConditionalGeneration).

This adapter handles vision-language models like Gemma 3 4B/12B/27B and MedGemma. The model structure is: - model.vision_tower: SigLIP vision encoder - model.multi_modal_projector: Projects vision embeddings to language space - model.language_model: Gemma3TextModel (same as text-only Gemma 3) - lm_head: Output projection

The language model component follows the same patterns as Gemma3ArchitectureAdapter.

__init__(cfg: Any) → None¶: Initialize the Gemma3 multimodal architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Gemma-3 multimodal component testing.

The language model uses dual RoPE (global + local) like text-only Gemma 3.

Parameters:

hf_model – The HuggingFace Gemma-3 multimodal model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.Gemma3nArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Text-only adapter for Gemma 3n (Gemma3nForConditionalGeneration).

applicable_phases: list[int] = [1, 2, 4]¶

required_libraries: list[str] = ['timm']¶

required_libraries_group: str = 'multimodal'¶

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Force eager attention so bridge and HF match (sliding/full layer mix).

class transformer_lens.model_bridge.supported_architectures.Gemma4ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Adapter for Gemma 4 (Gemma4ForConditionalGeneration — multimodal, or Gemma4UnifiedForConditionalGeneration — text-only 12B).

applicable_phases: list[int] = [1, 2, 4]¶

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Force eager attention so bridge and HF match (sliding/full layer mix).

class transformer_lens.model_bridge.supported_architectures.Glm4MoeArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for GLM-4.5 / 4.6 / 4.7 MoE decoder models.

GLM-4x MoE families use RMSNorm, RoPE and sparse routing, with early dense-MLP layers in some checkpoints. The dense layers are represented by a present-but-slightly-thinner mlp sub-module where routing is absent.

__init__(cfg: Any) → None¶: Initialize the GLM-4 MoE architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Set up rotary embedding references for GLM-4 MoE component testing.

class transformer_lens.model_bridge.supported_architectures.GlmMoeDsaArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Z.ai GLM-5 / GLM-5.1 DSA models.

GLM-MoE-DSA combines MLA-style latent attention, a learned sparse-attention indexer, dense early MLP layers, and sparse MoE later layers.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Set up rotary embedding references for component testing.

class transformer_lens.model_bridge.supported_architectures.Gpt2LmHeadCustomArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for GPT-2 LM Head Custom models.

__init__(cfg: Any) → None¶: Initialize the GPT-2 LM Head Custom architecture adapter.

class transformer_lens.model_bridge.supported_architectures.GptjArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for GPTJ models.

__init__(cfg: Any) → None¶: Initialize the GPTJ architecture adapter.

class transformer_lens.model_bridge.supported_architectures.GraniteArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for IBM Granite models (dense).

Granite is a Llama-like architecture with RMSNorm, rotary position embeddings (RoPE), GQA, and a gated MLP (SiLU activation). Granite-specific scaling multipliers are handled by the HF model’s native forward pass.

Optional Parameters (may not exist in state_dict):¶

Granite models do NOT have biases on attention and MLP projections:

blocks.{i}.attn.b_Q/b_K/b_V/b_O - No bias on attention projections
blocks.{i}.mlp.b_in/b_gate/b_out - No bias on MLP projections
blocks.{i}.ln1.b, blocks.{i}.ln2.b, ln_final.b - RMSNorm has no bias

__init__(cfg: Any) → None¶: Initialize the Granite architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Granite component testing.

Parameters:

hf_model – The HuggingFace Granite model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.GraniteMoeArchitectureAdapter(cfg: Any)¶

Bases: GraniteArchitectureAdapter

Architecture adapter for IBM Granite MoE models.

Identical to dense Granite but replaces the gated MLP with a Sparse Mixture of Experts block (block_sparse_moe) using batched expert parameters and top-k routing.

class transformer_lens.model_bridge.supported_architectures.GraniteMoeHybridArchitectureAdapter(cfg: Any)¶

Bases: GraniteArchitectureAdapter

Hybrid Mamba2 + Attention with Sparse MoE.

Attention is optional (absent on Mamba layers). shared_mlp and MoE are universal. Inherits Granite config and attention bridge construction.

class transformer_lens.model_bridge.supported_architectures.HubertArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for HuBERT audio models.

HubertForCTC nests HubertModel under a ‘hubert.’ prefix; prepare_model() detects this and adjusts component paths.

prepare_loading(model_name: str, model_kwargs: dict) → None¶

Propagate HuBERT-specific HF config attributes to bridge config.

Prevents silent-default bugs where adapter reads from bridge config but the attribute was never propagated from HF config.

prepare_model(hf_model: Any) → None¶: Detect HubertForCTC (has ‘hubert.’ prefix) and add CTC head.

supports_generation: bool = False¶

class transformer_lens.model_bridge.supported_architectures.HunYuanDenseV1ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for HunYuanDenseV1 models.

__init__(cfg: Any) → None¶: Initialize the HunYuanDenseV1 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Set up model-specific references for component testing.

class transformer_lens.model_bridge.supported_architectures.InternLM2ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for InternLM2 models.

InternLM2 uses remote code (trust_remote_code=True) and differs from Llama in: - Fused interleaved GQA wqkv weight (not standard [Q|K|V] split) - Non-standard module names: tok_embeddings, output, attention, feed_forward,

wqkv/wo, w1(gate)/w3(up)/w2(down), attention_norm, ffn_norm

Per-layer rotary_emb (no model-level shared instance)
supports_fold_ln=False: fold_ln is done manually in preprocess_weights because the bridge state dict has the fused qkv key, not split q/k/v keys, so fold_layer_norm’s extract_attention_tensors_for_folding would silently skip attn.

Optional parameters (may not exist in state_dict): - blocks.{i}.attn.b_Q / b_K / b_V / b_O — config.bias=False on shipped models - blocks.{i}.mlp.b_gate / b_in / b_out — MLP always bias=False - blocks.{i}.ln1.b / ln2.b / ln_final.b — RMSNorm has no bias

prepare_loading(model_name: str, model_kwargs: dict) → None¶: Patch transformers v5 incompatibilities before from_pretrained runs.

preprocess_weights(state_dict: dict[str, Tensor]) → dict[str, Tensor]¶

Fold layer norms into QKV and MLP weights.

Standard fold_ln can’t reach split Q/K/V when wqkv is fused in the bridge state dict. We extract and fold here, then write split keys so RearrangeTensorConversion can follow. MLP projections (w1/w2/w3) are separate linears so they fold normally. Mirrors phi3.py.preprocess_weights, adapted for InternLM2’s layout.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Inject per-layer rotary embedding for component testing.

class transformer_lens.model_bridge.supported_architectures.Lfm2MoeArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for LiquidAI LFM2 MoE models.

LFM2 MoE is a hybrid decoder with both short-convolution and full-attention layers. The adapter delegates each decoder layer to HF and exposes residual hooks around the whole layer rather than pretending every layer has a homogeneous attention/MLP substructure.

__init__(cfg: Any) → None¶: Initialize the LFM2 MoE architecture adapter.

applicable_phases: list[int] = [4]¶

prepare_loading(model_name: str, model_kwargs: dict) → None¶: Force eager attention when the HF config exposes the implementation knob.

prepare_model(hf_model: Any) → None¶: Force eager attention on the loaded HF model when supported.

class transformer_lens.model_bridge.supported_architectures.LlamaArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Llama models.

Optional Parameters (may not exist in state_dict):¶

LLaMA models do NOT have biases on attention and MLP projections:

blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP input (up_proj)
blocks.{i}.mlp.b_gate - No bias on MLP gate projection
blocks.{i}.mlp.b_out - No bias on MLP output (down_proj)
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias

Weight processing must handle these missing biases gracefully using ProcessWeights._safe_get_tensor() or by checking for None values.

__init__(cfg: Any) → None¶: Initialize the Llama architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Llama component testing.

Llama uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace Llama model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.LlavaArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for LLava multimodal models (LlavaForConditionalGeneration).

This adapter handles vision-language models like LLava 1.5. The model structure is: - model.vision_tower: CLIP vision encoder - model.multi_modal_projector: 2-layer MLP (Linear -> GELU -> Linear) - model.language_model: LlamaForCausalLM

model.language_model.model.embed_tokens

model.language_model.model.layers[]: LLaMA transformer blocks

model.language_model.model.norm

model.language_model.lm_head

The language model component follows the same patterns as LlamaArchitectureAdapter.

__init__(cfg: Any) → None¶: Initialize the LLava architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for LLava component testing.

LLava uses a LLaMA language backbone with RoPE. We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace LLava model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.LlavaNextArchitectureAdapter(cfg: Any)¶

Bases: LlavaArchitectureAdapter

Architecture adapter for LLaVA-NeXT (1.6) models.

class transformer_lens.model_bridge.supported_architectures.LlavaOnevisionArchitectureAdapter(cfg: Any)¶

Bases: LlavaArchitectureAdapter

Architecture adapter for LLaVA-OneVision models.

prepare_model(hf_model: Any) → None¶

Fix weight tying when text_config and top-level config disagree.

Some checkpoints have tie_word_embeddings=True in text_config but False at the top level, leaving lm_head randomly initialized.

class transformer_lens.model_bridge.supported_architectures.MPTArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

MPT adapter: ALiBi bias; all layers bias-free (no b_Q/b_K/b_V/b_O/b_in/b_out/ln bias).

class transformer_lens.model_bridge.supported_architectures.Mamba2ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Wraps HF’s Mamba2ForCausalLM.

Differs from Mamba-1 at the mixer level: fused in_proj (no x_proj/dt_proj), two-input inner norm, multi-head structure with num_heads/head_dim/ n_groups, and an [num_heads]-shaped dt_bias. Shares SSMBlockBridge, DepthwiseConv1DBridge, and the stateful generation loop with Mamba-1.

applicable_phases: list[int] = [4]¶

create_stateful_cache(hf_model: Any, batch_size: int, device: Any, dtype: dtype) → Any¶: Build a cache for the stateful generation loop.

class transformer_lens.model_bridge.supported_architectures.MambaArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Wraps HF’s MambaForCausalLM. No attention, no positional embeddings.

SSM config fields (state_size, conv_kernel, expand, time_step_rank, intermediate_size) are propagated from the HF config via _HF_PASSTHROUGH_ATTRS in sources/transformers.py.

applicable_phases: list[int] = [4]¶

create_stateful_cache(hf_model: Any, batch_size: int, device: Any, dtype: dtype) → Any¶: Build a cache for the stateful generation loop.

class transformer_lens.model_bridge.supported_architectures.MingptArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for MinGPT models.

__init__(cfg: Any) → None¶

Initialize the MinGPT architecture adapter.

Parameters:: cfg – The configuration object.

class transformer_lens.model_bridge.supported_architectures.MistralArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Mistral models.

__init__(cfg: Any) → None¶: Initialize the Mistral architecture adapter.

class transformer_lens.model_bridge.supported_architectures.MixtralArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Mixtral models.

Mixtral uses a pre-norm architecture with RMSNorm, rotary position embeddings (RoPE), and a Sparse Mixture of Experts MLP. Key features:

Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Rotary embeddings: stored at model.rotary_emb and passed per-forward-call.
Sparse MoE: batched expert parameters (gate_up_proj, down_proj as 3D tensors).
MixtralAttention.forward() requires position_embeddings and attention_mask args.
Optional GQA (n_key_value_heads may differ from n_heads).

__init__(cfg: Any) → None¶: Initialize the Mixtral architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Mixtral component testing.

Mixtral uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace Mixtral model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.NanogptArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for NanoGPT models.

__init__(cfg: Any) → None¶

Initialize the NanoGPT architecture adapter.

Parameters:: cfg – The configuration object.

convert_weights(remote_module: Any) → dict[str, Tensor]¶

class transformer_lens.model_bridge.supported_architectures.NativeArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Adapter for NativeModel — TL-native, split-QKV, pre-LN; feature set driven by cfg (gated MLP, RMS norm, rotary, GQA, soft-cap, attn_only).

prepare_model(model: Any) → None¶

Reject modules whose attribute names collide with bridge slots.

Bridge’s __getattr__ falls back to getattr(original_model, name) for unknown attrs, so a name match — submodule, buffer, plain tensor, or property — makes add_module raise mid-setup with an opaque message. Failing here points at the real cause. Reserved set is derived from component_mapping.keys() so adapter variants stay in sync.

class transformer_lens.model_bridge.supported_architectures.NeelSoluOldArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Neel’s SOLU models (old style).

__init__(cfg: Any) → None¶

Initialize the Neel SOLU old-style architecture adapter.

Parameters:: cfg – The configuration object.

class transformer_lens.model_bridge.supported_architectures.NemotronHArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for NemotronHForCausalLM.

Hybrid Mamba-2 + Attention + MoE + dense MLP model. All layers share a single pre-norm and a single residual connection; the mixer type per layer is determined by config.layers_block_type[layer_idx].

applicable_phases: list[int] = []¶

create_stateful_cache(hf_model: Any, batch_size: int, device: Any, dtype: Any) → Any¶

Build the unified DynamicCache for stateful generation.

Transformers ≥ 5.12 ships a unified DynamicCache that carries both KV-cache entries (attention layers) and SSM conv/recurrent states (Mamba layers) in a single object, using has_previous_state() to distinguish which state is available for a given layer index.

class transformer_lens.model_bridge.supported_architectures.NeoArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Neo models.

__init__(cfg: Any) → None¶: Initialize the Neo architecture adapter.

class transformer_lens.model_bridge.supported_architectures.NeoxArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for NeoX models.

__init__(cfg: Any) → None¶

Initialize the NeoX architecture adapter.

Parameters:: cfg – The configuration object.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for GPT-NeoX/StableLM component testing.

GPT-NeoX models use RoPE (Rotary Position Embeddings) which need to be set on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace GPT-NeoX model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

split_qkv_matrix(original_attention_component: Any) → tuple[Linear, Linear, Linear]¶

Split the QKV matrix into separate linear transformations.

GPT-NeoX/StableLM uses an interleaved QKV format where the weights are stored as [Q_h0, K_h0, V_h0, Q_h1, K_h1, V_h1, …] - i.e., Q, K, V are interleaved per head.

The weight shape is [n_heads * 3 * d_head, d_model] and the output is reshaped by HuggingFace as [batch, seq, n_heads, 3*d_head] then split on the last dim.

Parameters:: original_attention_component – The original attention layer component
Returns:: Tuple of nn.Linear modules for Q, K, and V transformations

class transformer_lens.model_bridge.supported_architectures.Olmo2ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for OLMo 2 models.

OLMo 2 uses a post-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE), and gated MLP (SwiGLU). Key differences from pre-norm models like Llama:

Post-norm: RMSNorm is applied AFTER attention and AFTER MLP, not before. ln1 maps to post_attention_layernorm, ln2 maps to post_feedforward_layernorm.
Q/K normalization: Per-head RMSNorm applied to queries and keys after projection.
No biases on any projections.

Optional Parameters (may not exist in state_dict):¶

blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP up_proj
blocks.{i}.mlp.b_gate - No bias on MLP gate_proj
blocks.{i}.mlp.b_out - No bias on MLP down_proj
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias

__init__(cfg: Any) → None¶: Initialize the OLMo 2 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for OLMo 2 component testing.

OLMo 2 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

We also force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks.

Parameters:

hf_model – The HuggingFace OLMo 2 model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.Olmo3ArchitectureAdapter(cfg: Any)¶

Bases: Olmo2ArchitectureAdapter

Architecture adapter for OLMo 3 / OLMo 3.1 models.

OLMo 3 is architecturally identical to OLMo 2 at the weight and component level. The only difference is sliding window attention on some layers (configurable via layer_types), which is handled by the HF model’s forward pass (mask creation) and does not affect weight structure or component mapping.

class transformer_lens.model_bridge.supported_architectures.OlmoArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for OLMo (v1) models.

OLMo v1 uses a pre-norm architecture with a custom non-learnable LayerNorm (fixed weight=1, bias=0), rotary position embeddings (RoPE), and gated MLP (SwiGLU). Key differences from later OLMo variants:

Pre-norm: LayerNorm is applied BEFORE attention and BEFORE MLP.
Non-learnable LayerNorm: Weight and bias are not trainable parameters. Delegating to HF’s native forward via NormalizationBridge handles this correctly.
No Q/K normalization in attention.
Optional QKV clipping (handled by HF’s native attention forward).

Optional Parameters (may not exist in state_dict):¶

blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP up_proj
blocks.{i}.mlp.b_gate - No bias on MLP gate_proj
blocks.{i}.mlp.b_out - No bias on MLP down_proj

__init__(cfg: Any) → None¶: Initialize the OLMo architecture adapter.

prepare_model(hf_model: Any) → None¶

Patch OLMo’s in-place clamp_ to avoid backward hook conflicts.

OLMo v1 uses query_states.clamp_() when config.clip_qkv is set. In-place ops on tensors that pass through register_full_backward_hook trigger PyTorch’s “view modified inplace” error. This patch disables the in-place clamp branch during attention forward passes.

Note: clip_qkv clamping is skipped in the patched forward. In practice clip_qkv values (typically 100+) rarely activate. If exact clamping is needed, add out-of-place clamp hooks on hook_q/hook_k/hook_v.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for OLMo component testing.

OLMo uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace OLMo model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.OlmoeArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for OLMoE (Mixture of Experts) models.

OLMoE uses a pre-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE), and sparse Mixture of Experts MLP. Key features:

Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Q/K normalization: RMSNorm applied to queries and keys after projection.
Sparse MoE: 64 experts with top-8 routing (configurable).
Batched expert parameters: gate_up_proj [num_experts, 2*d_mlp, d_model] and down_proj [num_experts, d_model, d_mlp] as single tensors, not a ModuleList.
Optional QKV clipping (handled by HF’s native attention forward).
No biases on any projections.

Optional Parameters (may not exist in state_dict):¶

blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias

__init__(cfg: Any) → None¶: Initialize the OLMoE architecture adapter.

prepare_model(hf_model: Any) → None¶

Patch OLMoE’s in-place clamp_ to avoid backward hook conflicts.

Same issue as OLMo v1 — see OlmoArchitectureAdapter.prepare_model.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for OLMoE component testing.

OLMoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace OLMoE model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.OpenElmArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Apple OpenELM models.

OpenELM uses a unique architecture with per-layer varying head counts and FFN dimensions. Key characteristics:

Combined QKV projection (qkv_proj) with per-layer varying Q/KV head counts
Gated MLP with combined gate+up projection (proj_1) and per-layer FFN sizes
RMSNorm normalization
Full rotary embeddings (per-layer, not shared)
Optional Q/K RMSNorm (normalize_qk_projections=True)
Weight tying (share_input_output_layers=True typically)
Model root is ‘transformer’ (not ‘model’)
Requires trust_remote_code=True (custom HF code)

The native HF attention handles all per-layer dimension variations, RoPE, GQA group repeat, and Q/K normalization internally. The bridge delegates to the native forward for correct computation.

Note: Individual Q/K/V hooks are not available since the model uses a combined QKV projection. Attention-level hooks (hook_attn_in, hook_attn_out) are provided.

__init__(cfg: Any) → None¶: Initialize the OpenELM architecture adapter.

prepare_loading(model_name: str, model_kwargs: dict) → None¶

Patch OpenELM for compatibility with transformers v5.

Two patches are needed: 1. RotaryEmbedding: Custom _compute_sin_cos_embeddings fails on meta device

because it calls .cos() on meta tensors. We wrap it to catch NotImplementedError.

Weight re-initialization: OpenELM’s _init_weights re-randomizes ALL weights after they’ve been loaded from safetensors because transformers v5’s _finalize_load_state_dict calls initialize_weights() on modules lacking the _is_hf_initialized flag. We patch _init_weights to skip real (non-meta) tensors.

Parameters:

model_name – The HuggingFace model name/path
model_kwargs – The kwargs dict for from_pretrained()

prepare_model(hf_model: Any) → None¶

Post-load fixes for non-persistent buffers zeroed during meta materialization.

Transformers v5 creates models on meta device then materializes weights from checkpoint. Non-persistent buffers (registered with persistent=False) are NOT in the checkpoint, so they materialize as zeros. OpenELM has two critical non-persistent buffers that must be recomputed:

RoPE inv_freq — zeroed inv_freq produces cos=1, sin=0 for all positions, destroying positional information entirely.
causal_mask — zeroed mask means no causal masking, allowing all positions to attend to future tokens. Single forward passes appear correct (no future tokens to leak) but autoregressive generation degenerates immediately.

We also create a synthetic lm_head for weight-tied models.

Note: We intentionally do NOT restore the original _compute_sin_cos_embeddings. The safe_compute wrapper is functionally equivalent for real (non-meta) tensors, and keeping it avoids issues when multiple models are loaded in the same process (e.g., benchmark suite loading both HF reference and bridge models).

Parameters:: hf_model – The loaded HuggingFace OpenELM model

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up references for OpenELM component testing.

Parameters:

hf_model – The HuggingFace OpenELM model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.OptArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for OPT models.

__init__(cfg: Any) → None¶: Initialize the OPT architecture adapter.

class transformer_lens.model_bridge.supported_architectures.Phi3ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Phi-3 models.

__init__(cfg: Any) → None¶

Initialize the Phi-3 architecture adapter.

Parameters:: cfg – The configuration object.

prepare_loading(model_name: str, model_kwargs: dict) → None¶: Patch cached Phi-3 remote code for transformers v5 compatibility.

preprocess_weights(state_dict: dict[str, Tensor]) → dict[str, Tensor]¶

Fold layer norms into joint QKV/gate_up projections.

Standard fold_ln can’t handle joint projections (shape mismatch on round-trip), so we scale the full joint weights directly.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Phi-3 component testing.

Parameters:

hf_model – The HuggingFace Phi-3 model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.PhiArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Phi models.

__init__(cfg: Any) → None¶

Initialize the Phi architecture adapter.

Parameters:: cfg – The configuration object.

default_cfg: dict[str, Any] = {'use_fast': False}¶

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Phi component testing.

Phi uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace Phi model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.PhiMoEArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Microsoft PhiMoE models.

PhiMoE is a Phi-style decoder with LayerNorm, split Q/K/V attention, and a sparse MoE block. This adapter targets the native Transformers implementation (trust_remote_code=False); the archived remote implementation is not compatible with modern Transformers generation/cache semantics.

__init__(cfg: Any) → None¶: Initialize the PhiMoE architecture adapter.

prepare_loading(model_name: str, model_kwargs: dict) → None¶: Force eager attention for consistent hookable generation.

prepare_model(hf_model: Any) → None¶: Force eager attention on the loaded HF model.

class transformer_lens.model_bridge.supported_architectures.Qwen2ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Qwen2 models.

Optional Parameters (may not exist in state_dict):¶

Qwen2 models do NOT have biases on any linear layers:

blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP input (up_proj)
blocks.{i}.mlp.b_gate - No bias on MLP gate projection
blocks.{i}.mlp.b_out - No bias on MLP output (down_proj)
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias

Weight processing must handle these missing biases gracefully using ProcessWeights._safe_get_tensor() or by checking for None values.

__init__(cfg: Any) → None¶: Initialize the Qwen2 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Qwen2 component testing.

Qwen2 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace Qwen2 model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.Qwen3ArchitectureAdapter(cfg: Any, *, hybrid: bool = False, lm_prefix: str = 'model')¶

Bases: ArchitectureAdapter

Architecture adapter for Qwen3 dense models.

RMSNorm, RoPE, GQA, Q/K head norms, gated MLP. No biases. Serves as base class for Qwen3.5 and Qwen3Next hybrid variants.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶: Set eager attn on HF model and rotary_emb on attention bridges.

class transformer_lens.model_bridge.supported_architectures.Qwen3MoeArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Qwen3MoE (Mixture of Experts) models.

Qwen3MoE is a sparse MoE decoder-only Transformer, structurally close to OLMoE. Key features:

Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Q/K normalization: RMSNorm applied to queries and keys after projection.
Sparse MoE: 128 experts with top-8 routing (public 30B-A3B checkpoints).
Batched expert parameters: gate_up_proj and down_proj as single 3D tensors, not a ModuleList.
final_rms=True (Qwen3-style; OLMoE uses False).
No biases on any projections.
GQA: n_key_value_heads < n_heads in all public checkpoints.

Only the all-MoE configuration is supported (decoder_sparse_step=1, mlp_only_layers=[]). Models with dense fallback layers cannot be wrapped because MoEBridge does not handle the dense Qwen3MoeMLP path.

Optional Parameters (may not exist in state_dict):¶

blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias

__init__(cfg: Any) → None¶: Initialize the Qwen3MoE architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Qwen3MoE component testing.

Qwen3MoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace Qwen3MoE model instance
bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.Qwen3NextArchitectureAdapter(cfg: Any)¶

Bases: Qwen3ArchitectureAdapter

Hybrid linear-attention + full-attention with sparse MoE MLP.

Same hybrid design as Qwen3.5 but with MoE instead of dense MLP.

preprocess_weights(state_dict: dict[str, Tensor]) → dict[str, Tensor]¶: Slice query half from gated q_proj.weight for weight-space analysis.

class transformer_lens.model_bridge.supported_architectures.Qwen3_5ArchitectureAdapter(cfg: Any)¶

Bases: Qwen3ArchitectureAdapter

Hybrid linear-attention + full-attention with dense gated MLP.

Inherits Qwen3 config/attention/MLP structure. Differences: - Attention + linear_attn are optional (per-layer type) - Gated q_proj (2x wide) sliced by preprocess_weights for weight analysis

prepare_loading(model_name: str, model_kwargs: dict) → None¶

Swap multimodal Qwen3_5Config for text-only Qwen3_5TextConfig.

Published checkpoints carry architectures=[‘Qwen3_5ForConditionalGeneration’]. We replace config with text_config so AutoModelForCausalLM loads the text-only Qwen3_5ForCausalLM.

prepare_model(hf_model: Any) → None¶: Reject full multimodal Qwen3.5 models on this text-only adapter.

preprocess_weights(state_dict: dict[str, Tensor]) → dict[str, Tensor]¶

Slice query half from gated q_proj.weight for weight-space analysis.

In processed mode, W_Q is the pure query projection (for composition scores, logit lens). Gate signal available in unprocessed mode on full-attention layers via blocks.N.attn.hook_q_gate.

class transformer_lens.model_bridge.supported_architectures.Qwen3_5MultimodalArchitectureAdapter(cfg: Any)¶

Bases: Qwen3ArchitectureAdapter

Full vision-language adapter for Qwen3_5ForConditionalGeneration.

preprocess_weights(state_dict: dict[str, Tensor]) → dict[str, Tensor]¶: Slice query half from gated q_proj.weight (matcher is path-prefix-agnostic).

required_libraries: list[str] = ['torchvision']¶

required_libraries_group: str = 'multimodal'¶

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set eager attn and rotary_emb refs for the nested language model.

Hybrid: only full-attention layers have self_attn/attn; linear-attention layers are skipped.

class transformer_lens.model_bridge.supported_architectures.QwenArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Qwen models.

__init__(cfg: Any) → None¶: Initialize the Qwen architecture adapter.

class transformer_lens.model_bridge.supported_architectures.SmolLM3ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for SmolLM3 models.

SmolLM3 is a pre-norm decoder with RMSNorm, grouped-query attention (GQA), a SwiGLU gated MLP, rotary position embeddings (RoPE), tied input and output embeddings, and no biases on any projection. The block shape matches Llama and Qwen2 exactly, so the component mapping and weight conversions mirror qwen2.py.

NoPE (No Positional Encoding): SmolLM3 disables RoPE on every no_rope_layer_interval-th layer (default every 4th) via config.no_rope_layers. That per-layer toggle lives inside HF’s SmolLM3Attention.forward, but the bridge reimplements attention and would otherwise rotate Q and K on those layers. The _SmolLM3AttentionBridge subclass handles it by suppressing position embeddings on NoPE layers, so the reimplemented attention matches HF.

No Q/K normalization: unlike Qwen3, SmolLM3 has no per-head Q or K RMSNorm, so the attention block uses the plain q/k/v/o submodules.

Optional Parameters (may not exist in state_dict):¶

SmolLM3 models do NOT have biases on any linear layers:

blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP input (up_proj)
blocks.{i}.mlp.b_gate - No bias on MLP gate projection
blocks.{i}.mlp.b_out - No bias on MLP output (down_proj)
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias

Weight processing must handle these missing biases gracefully using ProcessWeights._safe_get_tensor() or by checking for None values.

__init__(cfg: Any) → None¶: Initialize the SmolLM3 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Wire rotary embeddings and force eager attention for component testing.

SmolLM3 uses RoPE on most layers (a periodic subset are NoPE, handled by the attention bridge). We set the shared rotary_emb reference on every attention bridge instance and pin eager attention so the bridge’s reimplemented forward matches the HF reference numerically. Setting rotary_emb on NoPE-layer bridges is harmless: those bridges suppress position embeddings before the rotary step, so the reference goes unused there.

Parameters:

hf_model – The HuggingFace SmolLM3 model instance.
bridge_model – The TransformerBridge model, when available, so the rotary reference is set on the live attention bridge instances.

class transformer_lens.model_bridge.supported_architectures.StableLmArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for StableLM models.

StableLM uses a Llama-like architecture with separate Q/K/V projections and gated MLP, but differs in using standard LayerNorm (not RMSNorm) and partial rotary embeddings (25% of head dimensions by default).

Supports optional features: - Grouped Query Attention (num_key_value_heads != num_attention_heads) - QKV bias (use_qkv_bias=True on some models like stable-code-3b) - Parallel residual connections (use_parallel_residual=True) - Per-head QK LayerNorm (qk_layernorm=True)

Optional Parameters (may not exist in state_dict):¶

blocks.{i}.attn.b_Q - Only present when use_qkv_bias=True
blocks.{i}.attn.b_K - Only present when use_qkv_bias=True
blocks.{i}.attn.b_V - Only present when use_qkv_bias=True
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.mlp.b_in - No bias on MLP up_proj
blocks.{i}.mlp.b_gate - No bias on MLP gate_proj
blocks.{i}.mlp.b_out - No bias on MLP down_proj

__init__(cfg: Any) → None¶: Initialize the StableLM architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for StableLM component testing.

StableLM uses RoPE (Rotary Position Embeddings) with partial rotation. We set the rotary_emb reference on all attention bridge instances and force eager attention for numerical consistency.

Parameters:

hf_model – The HuggingFace StableLM model instance
bridge_model – The TransformerBridge model (if available)

setup_hook_compatibility(bridge: Any) → None¶

Inject hook points for QK LayerNorm on models with qk_layernorm=True.

StableLM v2 models (e.g., stablelm-2-12b) apply per-head LayerNorm to Q and K after projection but before rotary embedding. The native HF attention handles this internally, but we inject hooks so researchers can observe/intervene on the post-norm Q/K values.

Adds to each attention bridge:

hook_q_layernorm: fires after q_layernorm(query_states)
hook_k_layernorm: fires after k_layernorm(key_states)

This runs during bridge __init__ via _setup_hook_compatibility(), after component setup but before hook registry finalization. The hook registry scanner skips _original_component subtrees, so we register hooks directly in bridge._hook_registry with canonical TL-style names.

Parameters:: bridge – The TransformerBridge instance (fully initialized)

class transformer_lens.model_bridge.supported_architectures.T5ArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for T5 models.

T5 is an encoder-decoder model with: - Shared embeddings - Encoder stack (self-attention + FFN) - Decoder stack (self-attention + cross-attention + FFN) - Language modeling head

Supports both standard T5 (DenseReluDense with wi/wo) and gated variants like Flan-T5 (T5DenseGatedActDense with wi_0/wi_1/wo).

__init__(cfg: Any) → None¶

Initialize the T5 architecture adapter.

Parameters:: cfg – The configuration object.

class transformer_lens.model_bridge.supported_architectures.T5GemmaArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for T5GemmaForConditionalGeneration.

Encoder: BlockBridge over model.encoder.layers (Gemma-style, no cross-attn) Decoder: T5GemmaDecoderBlockBridge over model.decoder.layers (adds cross-attn hooks)

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for T5Gemma component testing.

Both the encoder and decoder carry their own rotary_emb. We set the reference on all PositionEmbeddingsAttentionBridge instances so that component-level forward calls can compute RoPE correctly.

class transformer_lens.model_bridge.supported_architectures.XGLMArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for XGLM models.

XGLM uses pre-norm LayerNorm, sinusoidal positional embeddings (no learnable weights), standard MHA with separate q/k/v/out_proj, and a 2-layer MLP (fc1/fc2) that lives directly on the decoder block rather than inside an mlp sub-module.

All attention projections and fc1/fc2 carry biases. lm_head has no bias. Embeddings are scaled by sqrt(d_model) at runtime in XGLMScaledWordEmbedding.

Optional Parameters (may not exist in state_dict):¶

None — all published XGLM checkpoints include all parameters listed above.

__init__(cfg: Any) → None¶: Initialize the XGLM architecture adapter.