transformer_lens.model_bridge.supported_architectures package

Submodules

Module contents

Supported architecture adapters.

This module contains all the supported architecture adapters for different model architectures.

class transformer_lens.model_bridge.supported_architectures.ApertusArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Apertus models.

Apertus uses a pre-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE with LLaMA-3 scaling), grouped query attention (GQA), non-gated MLP (XiELU activation), and no biases on any projections.

Similar to Qwen3 (pre-norm RMSNorm, QK-norm, GQA, RoPE) but uses a non-gated MLP (up_proj -> XiELU -> down_proj) instead of gated MLP.

Note: Apertus uses different layer norm names than most Llama-family models: - attention_layernorm (instead of input_layernorm) - feedforward_layernorm (instead of post_attention_layernorm)

__init__(cfg: Any) None

Initialize the Apertus architecture adapter.

prepare_loading(model_name: str, model_kwargs: dict) None

Patch XIELUActivation to defer eager .item() calls for meta tensor compat.

Transformers v5 uses meta tensors during from_pretrained, but XIELUActivation.__init__ eagerly calls .item() on beta/eps buffers to precompute _beta_scalar/_eps_scalar for the CUDA kernel path. This fails on meta device. Once upstream fixes this (transformers PR #43473), this patch can be removed.

Instead of reimplementing __init__, we wrap it to catch the meta tensor failure and defer scalar computation to forward() time.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Apertus component testing.

Apertus uses RoPE (Rotary Position Embeddings). We set the rotary_emb on all attention bridge instances for component testing.

We also force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks.

Parameters:
  • hf_model – The HuggingFace Apertus model instance

  • bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.BertArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for BERT models.

__init__(cfg: Any) None

Initialize the BERT architecture adapter.

Parameters:

cfg – The configuration object.

prepare_model(hf_model: Any) None

Adjust component mapping based on the actual HF model variant.

BertForMaskedLM has cls.predictions (MLM head). BertForNextSentencePrediction has cls.seq_relationship (NSP head) and no MLM-specific LayerNorm.

class transformer_lens.model_bridge.supported_architectures.BloomArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Bloom models.

__init__(cfg: Any) None

Initialize the Bloom architecture adapter.

split_qkv_matrix(original_attention_component: Any) tuple[Linear, Linear, Linear]

Split the QKV matrix into separate linear transformations. :param attention_component: The original attention layer component

Returns:

Tuple of nn.Linear modules for Q, K, and V transformations

class transformer_lens.model_bridge.supported_architectures.CodeGenArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for CodeGen models.

CodeGen uses a parallel attention+MLP block (attn and MLP share the same LayerNorm input and their outputs are summed). The attention layer uses a fused qkv_proj weight whose layout follows GPT-J’s mp_num=4 tensor-parallel partitioning: the rows are interleaved as [Q_part, V_part, K_part] within each of the 4 MP partitions.

Optional Parameters (may be absent in some CodeGen checkpoints):

  • No bias on qkv_proj (fused QKV has no bias)

  • No bias on out_proj

  • No bias on mlp.fc_in or mlp.fc_out

__init__(cfg: Any) None

Initialize the CodeGen architecture adapter.

split_qkv_matrix(attn_component: Any) tuple[Linear, Linear, Linear]

Split the fused QKV weight into separate Q, K, V linear modules.

CodeGen uses GPT-J-style tensor-parallel partitioning with mp_num=4 partitions. Within each partition the row order is [Q_part, V_part, K_part], i.e. not the conventional Q/K/V order.

The fused weight has shape [3 * n_embd, n_embd]. We reshape to [mp_num, 3, local_dim, n_embd], extract the three slices, then flatten back to [n_embd, n_embd] for each of Q, K, V.

Parameters:

attn_component – The original CodeGenAttention module.

Returns:

Tuple of (q_linear, k_linear, v_linear) — three nn.Linear modules with no bias and weight shape [n_embd, n_embd].

class transformer_lens.model_bridge.supported_architectures.CohereArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Cohere models (CohereForCausalLM).

Architectural quirks vs. standard decoder-only models: - Single input_layernorm per block; NO post_attention_layernorm.

Attention and MLP both read the SAME normed hidden states (parallel).

  • CohereLayerNorm is true LayerNorm (mean-subtracting), NOT RMSNorm. It has a weight parameter but NO bias parameter.

  • Logit scale: CohereForCausalLM.forward multiplies logits by logit_scale (default 0.0625 = 1/16). Folded into unembed.weight via preprocess_weights.

  • Rotary embeddings use repeat_interleave instead of cat-split (delegated to HF).

Optional parameters (absent from state_dict by default): - blocks.{i}.attn.b_Q/b_K/b_V/b_O — no bias on projections (attention_bias=False) - blocks.{i}.mlp.b_gate/b_in/b_out — no bias on MLP projections - blocks.{i}.ln1.b — CohereLayerNorm has no bias - ln_final.b — CohereLayerNorm has no bias

__init__(cfg: Any) None

Initialize the Cohere architecture adapter.

preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]

Fold logit_scale into unembed weights before ProcessWeights runs.

bridge.py lines 726-732 clone unembed.weight before calling this, so scaling does not affect the tied embed.weight. logit_scale=1.0 is a no-op (skipped for efficiency).

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set rotary embedding reference on attention bridges for component testing.

CohereRotaryEmbedding lives at hf_model.model.rotary_emb. The bridge delegates to it directly, preserving the repeat_interleave RoPE convention without re-implementing it in TL.

Pattern matches llama.py and qwen2.py.

class transformer_lens.model_bridge.supported_architectures.DeepSeekV3ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for DeepSeek V3 / R1 models.

Uses RMSNorm, MLA with compressed Q/KV projections, partial RoPE, MoE on most layers (dense MLP on first few), and no biases.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for component testing.

class transformer_lens.model_bridge.supported_architectures.FalconArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Falcon models (FalconForCausalLM).

prepare_model(hf_model: Any) None

Patch Falcon modules to avoid backward hook conflicts.

Two issues: 1. FalconLinear does input @ self.weight.T where .T is a view —

clone the transpose to break the view chain.

  1. FalconDecoderLayer does mlp_output += attention_output (inplace) — this modifies a tensor captured by mlp.hook_out’s backward hook. Patch to use non-inplace addition.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for component testing.

class transformer_lens.model_bridge.supported_architectures.GPT2ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for GPT2 models.

Optional Parameters (may not exist in state_dict):

GPT-2 models HAVE biases on ALL linear layers:

✓ blocks.{i}.attn.b_Q - Has bias (from combined c_attn.bias) ✓ blocks.{i}.attn.b_K - Has bias (from combined c_attn.bias) ✓ blocks.{i}.attn.b_V - Has bias (from combined c_attn.bias) ✓ blocks.{i}.attn.b_O - Has bias (c_proj.bias) ✓ blocks.{i}.mlp.b_in - Has bias (c_fc.bias) ✓ blocks.{i}.mlp.b_out - Has bias (c_proj.bias) ✓ blocks.{i}.ln1.b - LayerNorm has bias ✓ blocks.{i}.ln2.b - LayerNorm has bias ✓ ln_final.b - LayerNorm has bias

No optional parameters - all biases exist in GPT-2.

__init__(cfg: Any) None

Initialize the GPT2 architecture adapter.

class transformer_lens.model_bridge.supported_architectures.GPTBigCodeArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for GPTBigCode models.

GPTBigCode is a GPT-2 variant using Multi-Query Attention (MQA): a single fused c_attn projection whose output splits asymmetrically into [embed_dim, head_dim, head_dim] for Q/K/V (rather than three equal thirds). All other structure (module paths, LayerNorm, learned pos embeddings, standard MLP) is identical to GPT-2.

All public models use multi_query=True (1 KV head). The adapter assumes MQA throughout.

All linear layers have biases (c_attn, c_proj, c_fc, mlp.c_proj). lm_head has no bias and its weight is tied to transformer.wte.weight.

Weight layout difference from GPT-2: GPTBigCode uses nn.Linear (weights stored [out, in]) rather than GPT-2’s Conv1D ([in, out]), so no unembed weight transpose is needed.

class transformer_lens.model_bridge.supported_architectures.GPTOSSArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for GPT-OSS model.

__init__(cfg: Any) None

Initialize the GPT-OSS architecture adapter.

setup_hook_compatibility(bridge_model: Any) None

Setup hook compatibility transformations for GPT-OSS models.

This configures rotary embedding references for attention layers, which is needed for models using RoPE (Rotary Position Embeddings).

This is called during Bridge.__init__ and should always be run.

Parameters:

bridge_model – The TransformerBridge instance

setup_no_processing_hooks(bridge_model: Any) None

Backward compatibility alias for setup_hook_compatibility.

class transformer_lens.model_bridge.supported_architectures.Gemma1ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Gemma1 models.

__init__(cfg: Any) None

Initialize the Gemma1 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Gemma1 component testing.

Gemma1 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace Gemma1 model instance

  • bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

setup_hook_compatibility(bridge: Any) None

Setup hook compatibility for Gemma1 models.

Gemma1 scales embeddings by sqrt(d_model) in its forward pass, but the HuggingFace embed_tokens layer doesn’t include this scaling. We need to apply it to hook_embed to match HookedTransformer behavior.

Parameters:

bridge – The TransformerBridge instance

class transformer_lens.model_bridge.supported_architectures.Gemma2ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Gemma2 models.

__init__(cfg: Any) None

Initialize the Gemma2 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references and attention implementation for Gemma-2 component testing.

Gemma-2 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

We also force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks, while HF defaults to “sdpa”. These produce mathematically equivalent results but with small numerical differences due to different implementations.

Parameters:
  • hf_model – The HuggingFace Gemma-2 model instance

  • bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

setup_hook_compatibility(bridge: Any) None

Setup hook compatibility for Gemma2 models.

Gemma2 scales embeddings by sqrt(d_model). The weights are pre-scaled via preprocess_weights(), but we still need to apply the scaling conversion to the hook output for proper hook functionality (so user modifications are correctly scaled/unscaled).

Parameters:

bridge – The TransformerBridge instance

class transformer_lens.model_bridge.supported_architectures.Gemma3ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Gemma3 models.

__init__(cfg: Any) None

Initialize the Gemma3 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references and native autograd for Gemma-3 component testing.

Gemma-3 uses dual RoPE (global + local). We set local RoPE (used by 85% of layers) on all attention bridge instances for component testing.

We also enable use_native_layernorm_autograd on all normalization bridges to ensure they delegate to HuggingFace’s exact implementation instead of using manual computation.

Additionally, we force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks, while HF defaults to “sdpa”. These produce mathematically equivalent results but with small numerical differences due to different implementations.

Note: Layers 5, 11, 17, 23 use global RoPE but will use local in component tests. This is an acceptable tradeoff given the shared-instance constraint.

Parameters:
  • hf_model – The HuggingFace Gemma-3 model instance

  • bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

setup_hook_compatibility(bridge: Any) None

Setup hook compatibility for Gemma3 models.

Unlike Gemma1/Gemma2, Gemma3 uses Gemma3TextScaledWordEmbedding which scales embeddings by sqrt(d_model) INSIDE the embedding layer’s forward(). Therefore we do NOT need a hook_conversion — the embed.hook_out already captures the scaled output. Adding a conversion would double-scale.

(Gemma1/Gemma2 scale in GemmaModel.forward() AFTER the embedding layer, so their adapters correctly use EmbeddingScaleConversion to match HT.)

Parameters:

bridge – The TransformerBridge instance

class transformer_lens.model_bridge.supported_architectures.Gemma3MultimodalArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Gemma3 multimodal models (Gemma3ForConditionalGeneration).

This adapter handles vision-language models like Gemma 3 4B/12B/27B and MedGemma. The model structure is: - model.vision_tower: SigLIP vision encoder - model.multi_modal_projector: Projects vision embeddings to language space - model.language_model: Gemma3TextModel (same as text-only Gemma 3) - lm_head: Output projection

The language model component follows the same patterns as Gemma3ArchitectureAdapter.

__init__(cfg: Any) None

Initialize the Gemma3 multimodal architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Gemma-3 multimodal component testing.

The language model uses dual RoPE (global + local) like text-only Gemma 3.

Parameters:
  • hf_model – The HuggingFace Gemma-3 multimodal model instance

  • bridge_model – The TransformerBridge model (if available)

setup_hook_compatibility(bridge: Any) None

Setup hook compatibility for Gemma3 multimodal models.

Like text-only Gemma 3, the multimodal model uses Gemma3TextScaledWordEmbedding which scales embeddings by sqrt(d_model) internally in its forward() method. No additional hook conversion is needed — adding one would double-scale the embeddings.

Parameters:

bridge – The TransformerBridge instance

class transformer_lens.model_bridge.supported_architectures.Gpt2LmHeadCustomArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for GPT-2 LM Head Custom models.

__init__(cfg: Any) None

Initialize the GPT-2 LM Head Custom architecture adapter.

class transformer_lens.model_bridge.supported_architectures.GptjArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for GPTJ models.

__init__(cfg: Any) None

Initialize the GPTJ architecture adapter.

class transformer_lens.model_bridge.supported_architectures.GraniteArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for IBM Granite models (dense).

Granite is a Llama-like architecture with RMSNorm, rotary position embeddings (RoPE), GQA, and a gated MLP (SiLU activation). Granite-specific scaling multipliers are handled by the HF model’s native forward pass.

Optional Parameters (may not exist in state_dict):

Granite models do NOT have biases on attention and MLP projections:

  • blocks.{i}.attn.b_Q/b_K/b_V/b_O - No bias on attention projections

  • blocks.{i}.mlp.b_in/b_gate/b_out - No bias on MLP projections

  • blocks.{i}.ln1.b, blocks.{i}.ln2.b, ln_final.b - RMSNorm has no bias

__init__(cfg: Any) None

Initialize the Granite architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Granite component testing.

Parameters:
  • hf_model – The HuggingFace Granite model instance

  • bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.GraniteMoeArchitectureAdapter(cfg: Any)

Bases: GraniteArchitectureAdapter

Architecture adapter for IBM Granite MoE models.

Identical to dense Granite but replaces the gated MLP with a Sparse Mixture of Experts block (block_sparse_moe) using batched expert parameters and top-k routing.

class transformer_lens.model_bridge.supported_architectures.GraniteMoeHybridArchitectureAdapter(cfg: Any)

Bases: GraniteArchitectureAdapter

Hybrid Mamba2 + Attention with Sparse MoE.

Attention is optional (absent on Mamba layers). shared_mlp and MoE are universal. Inherits Granite config and attention bridge construction.

class transformer_lens.model_bridge.supported_architectures.HubertArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for HuBERT audio models.

HubertForCTC nests HubertModel under a ‘hubert.’ prefix; prepare_model() detects this and adjusts component paths.

prepare_loading(model_name: str, model_kwargs: dict) None

Propagate HuBERT-specific HF config attributes to bridge config.

Prevents silent-default bugs where adapter reads from bridge config but the attribute was never propagated from HF config.

prepare_model(hf_model: Any) None

Detect HubertForCTC (has ‘hubert.’ prefix) and add CTC head.

class transformer_lens.model_bridge.supported_architectures.InternLM2ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for InternLM2 models.

InternLM2 uses remote code (trust_remote_code=True) and differs from Llama in: - Fused interleaved GQA wqkv weight (not standard [Q|K|V] split) - Non-standard module names: tok_embeddings, output, attention, feed_forward,

wqkv/wo, w1(gate)/w3(up)/w2(down), attention_norm, ffn_norm

  • Per-layer rotary_emb (no model-level shared instance)

  • supports_fold_ln=False: fold_ln is done manually in preprocess_weights because the bridge state dict has the fused qkv key, not split q/k/v keys, so fold_layer_norm’s extract_attention_tensors_for_folding would silently skip attn.

Optional parameters (may not exist in state_dict): - blocks.{i}.attn.b_Q / b_K / b_V / b_O — config.bias=False on shipped models - blocks.{i}.mlp.b_gate / b_in / b_out — MLP always bias=False - blocks.{i}.ln1.b / ln2.b / ln_final.b — RMSNorm has no bias

prepare_loading(model_name: str, model_kwargs: dict) None

Patch transformers v5 incompatibilities before from_pretrained runs.

preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]

Fold layer norms into QKV and MLP weights.

Standard fold_ln can’t reach split Q/K/V when wqkv is fused in the bridge state dict. We extract and fold here, then write split keys so RearrangeTensorConversion can follow. MLP projections (w1/w2/w3) are separate linears so they fold normally. Mirrors phi3.py.preprocess_weights, adapted for InternLM2’s layout.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Inject per-layer rotary embedding for component testing.

class transformer_lens.model_bridge.supported_architectures.LlamaArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Llama models.

Optional Parameters (may not exist in state_dict):

LLaMA models do NOT have biases on attention and MLP projections:

  • blocks.{i}.attn.b_Q - No bias on query projection

  • blocks.{i}.attn.b_K - No bias on key projection

  • blocks.{i}.attn.b_V - No bias on value projection

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.mlp.b_in - No bias on MLP input (up_proj)

  • blocks.{i}.mlp.b_gate - No bias on MLP gate projection

  • blocks.{i}.mlp.b_out - No bias on MLP output (down_proj)

  • blocks.{i}.ln1.b - RMSNorm has no bias

  • blocks.{i}.ln2.b - RMSNorm has no bias

  • ln_final.b - RMSNorm has no bias

Weight processing must handle these missing biases gracefully using ProcessWeights._safe_get_tensor() or by checking for None values.

__init__(cfg: Any) None

Initialize the Llama architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Llama component testing.

Llama uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace Llama model instance

  • bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.LlavaArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for LLava multimodal models (LlavaForConditionalGeneration).

This adapter handles vision-language models like LLava 1.5. The model structure is: - model.vision_tower: CLIP vision encoder - model.multi_modal_projector: 2-layer MLP (Linear -> GELU -> Linear) - model.language_model: LlamaForCausalLM

  • model.language_model.model.embed_tokens

  • model.language_model.model.layers[]: LLaMA transformer blocks

  • model.language_model.model.norm

  • model.language_model.lm_head

The language model component follows the same patterns as LlamaArchitectureAdapter.

__init__(cfg: Any) None

Initialize the LLava architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for LLava component testing.

LLava uses a LLaMA language backbone with RoPE. We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace LLava model instance

  • bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.LlavaNextArchitectureAdapter(cfg: Any)

Bases: LlavaArchitectureAdapter

Architecture adapter for LLaVA-NeXT (1.6) models.

class transformer_lens.model_bridge.supported_architectures.LlavaOnevisionArchitectureAdapter(cfg: Any)

Bases: LlavaArchitectureAdapter

Architecture adapter for LLaVA-OneVision models.

prepare_model(hf_model: Any) None

Fix weight tying when text_config and top-level config disagree.

Some checkpoints have tie_word_embeddings=True in text_config but False at the top level, leaving lm_head randomly initialized.

class transformer_lens.model_bridge.supported_architectures.MPTArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

MPT adapter: ALiBi bias; all layers bias-free (no b_Q/b_K/b_V/b_O/b_in/b_out/ln bias).

class transformer_lens.model_bridge.supported_architectures.Mamba2ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Wraps HF’s Mamba2ForCausalLM.

Differs from Mamba-1 at the mixer level: fused in_proj (no x_proj/dt_proj), two-input inner norm, multi-head structure with num_heads/head_dim/ n_groups, and an [num_heads]-shaped dt_bias. Shares SSMBlockBridge, DepthwiseConv1DBridge, and the stateful generation loop with Mamba-1.

applicable_phases: list[int] = []
create_stateful_cache(hf_model: Any, batch_size: int, device: Any, dtype: dtype) Any

Build a Mamba2Cache for the stateful generation loop.

class transformer_lens.model_bridge.supported_architectures.MambaArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Wraps HF’s MambaForCausalLM. No attention, no positional embeddings.

SSM config fields (state_size, conv_kernel, expand, time_step_rank, intermediate_size) are propagated from the HF config via _HF_PASSTHROUGH_ATTRS in sources/transformers.py.

applicable_phases: list[int] = []
create_stateful_cache(hf_model: Any, batch_size: int, device: Any, dtype: dtype) Any

Build a MambaCache for the stateful generation loop.

class transformer_lens.model_bridge.supported_architectures.MingptArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for MinGPT models.

__init__(cfg: Any) None

Initialize the MinGPT architecture adapter.

Parameters:

cfg – The configuration object.

class transformer_lens.model_bridge.supported_architectures.MistralArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Mistral models.

__init__(cfg: Any) None

Initialize the Mistral architecture adapter.

class transformer_lens.model_bridge.supported_architectures.MixtralArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Mixtral models.

Mixtral uses a pre-norm architecture with RMSNorm, rotary position embeddings (RoPE), and a Sparse Mixture of Experts MLP. Key features:

  • Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.

  • Rotary embeddings: stored at model.rotary_emb and passed per-forward-call.

  • Sparse MoE: batched expert parameters (gate_up_proj, down_proj as 3D tensors).

  • MixtralAttention.forward() requires position_embeddings and attention_mask args.

  • Optional GQA (n_key_value_heads may differ from n_heads).

__init__(cfg: Any) None

Initialize the Mixtral architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Mixtral component testing.

Mixtral uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace Mixtral model instance

  • bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.NanogptArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for NanoGPT models.

__init__(cfg: Any) None

Initialize the NanoGPT architecture adapter.

Parameters:

cfg – The configuration object.

convert_weights(remote_module: Any) dict[str, Tensor]
class transformer_lens.model_bridge.supported_architectures.NeelSoluOldArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Neel’s SOLU models (old style).

__init__(cfg: Any) None

Initialize the Neel SOLU old-style architecture adapter.

Parameters:

cfg – The configuration object.

class transformer_lens.model_bridge.supported_architectures.NeoArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Neo models.

__init__(cfg: Any) None

Initialize the Neo architecture adapter.

class transformer_lens.model_bridge.supported_architectures.NeoxArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for NeoX models.

__init__(cfg: Any) None

Initialize the NeoX architecture adapter.

Parameters:

cfg – The configuration object.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for GPT-NeoX/StableLM component testing.

GPT-NeoX models use RoPE (Rotary Position Embeddings) which need to be set on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace GPT-NeoX model instance

  • bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

split_qkv_matrix(original_attention_component: Any) tuple[Linear, Linear, Linear]

Split the QKV matrix into separate linear transformations.

GPT-NeoX/StableLM uses an interleaved QKV format where the weights are stored as [Q_h0, K_h0, V_h0, Q_h1, K_h1, V_h1, …] - i.e., Q, K, V are interleaved per head.

The weight shape is [n_heads * 3 * d_head, d_model] and the output is reshaped by HuggingFace as [batch, seq, n_heads, 3*d_head] then split on the last dim.

Parameters:

original_attention_component – The original attention layer component

Returns:

Tuple of nn.Linear modules for Q, K, and V transformations

class transformer_lens.model_bridge.supported_architectures.Olmo2ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for OLMo 2 models.

OLMo 2 uses a post-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE), and gated MLP (SwiGLU). Key differences from pre-norm models like Llama:

  • Post-norm: RMSNorm is applied AFTER attention and AFTER MLP, not before. ln1 maps to post_attention_layernorm, ln2 maps to post_feedforward_layernorm.

  • Q/K normalization: Per-head RMSNorm applied to queries and keys after projection.

  • No biases on any projections.

Optional Parameters (may not exist in state_dict):

  • blocks.{i}.attn.b_Q - No bias on query projection

  • blocks.{i}.attn.b_K - No bias on key projection

  • blocks.{i}.attn.b_V - No bias on value projection

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.mlp.b_in - No bias on MLP up_proj

  • blocks.{i}.mlp.b_gate - No bias on MLP gate_proj

  • blocks.{i}.mlp.b_out - No bias on MLP down_proj

  • blocks.{i}.ln1.b - RMSNorm has no bias

  • blocks.{i}.ln2.b - RMSNorm has no bias

  • ln_final.b - RMSNorm has no bias

__init__(cfg: Any) None

Initialize the OLMo 2 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for OLMo 2 component testing.

OLMo 2 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

We also force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks.

Parameters:
  • hf_model – The HuggingFace OLMo 2 model instance

  • bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.Olmo3ArchitectureAdapter(cfg: Any)

Bases: Olmo2ArchitectureAdapter

Architecture adapter for OLMo 3 / OLMo 3.1 models.

OLMo 3 is architecturally identical to OLMo 2 at the weight and component level. The only difference is sliding window attention on some layers (configurable via layer_types), which is handled by the HF model’s forward pass (mask creation) and does not affect weight structure or component mapping.

class transformer_lens.model_bridge.supported_architectures.OlmoArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for OLMo (v1) models.

OLMo v1 uses a pre-norm architecture with a custom non-learnable LayerNorm (fixed weight=1, bias=0), rotary position embeddings (RoPE), and gated MLP (SwiGLU). Key differences from later OLMo variants:

  • Pre-norm: LayerNorm is applied BEFORE attention and BEFORE MLP.

  • Non-learnable LayerNorm: Weight and bias are not trainable parameters. Delegating to HF’s native forward via NormalizationBridge handles this correctly.

  • No Q/K normalization in attention.

  • Optional QKV clipping (handled by HF’s native attention forward).

Optional Parameters (may not exist in state_dict):

  • blocks.{i}.attn.b_Q - No bias on query projection

  • blocks.{i}.attn.b_K - No bias on key projection

  • blocks.{i}.attn.b_V - No bias on value projection

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.mlp.b_in - No bias on MLP up_proj

  • blocks.{i}.mlp.b_gate - No bias on MLP gate_proj

  • blocks.{i}.mlp.b_out - No bias on MLP down_proj

__init__(cfg: Any) None

Initialize the OLMo architecture adapter.

prepare_model(hf_model: Any) None

Patch OLMo’s in-place clamp_ to avoid backward hook conflicts.

OLMo v1 uses query_states.clamp_() when config.clip_qkv is set. In-place ops on tensors that pass through register_full_backward_hook trigger PyTorch’s “view modified inplace” error. This patch disables the in-place clamp branch during attention forward passes.

Note: clip_qkv clamping is skipped in the patched forward. In practice clip_qkv values (typically 100+) rarely activate. If exact clamping is needed, add out-of-place clamp hooks on hook_q/hook_k/hook_v.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for OLMo component testing.

OLMo uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace OLMo model instance

  • bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.OlmoeArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for OLMoE (Mixture of Experts) models.

OLMoE uses a pre-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE), and sparse Mixture of Experts MLP. Key features:

  • Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.

  • Q/K normalization: RMSNorm applied to queries and keys after projection.

  • Sparse MoE: 64 experts with top-8 routing (configurable).

  • Batched expert parameters: gate_up_proj [num_experts, 2*d_mlp, d_model] and down_proj [num_experts, d_model, d_mlp] as single tensors, not a ModuleList.

  • Optional QKV clipping (handled by HF’s native attention forward).

  • No biases on any projections.

Optional Parameters (may not exist in state_dict):

  • blocks.{i}.attn.b_Q - No bias on query projection

  • blocks.{i}.attn.b_K - No bias on key projection

  • blocks.{i}.attn.b_V - No bias on value projection

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.ln1.b - RMSNorm has no bias

  • blocks.{i}.ln2.b - RMSNorm has no bias

  • ln_final.b - RMSNorm has no bias

__init__(cfg: Any) None

Initialize the OLMoE architecture adapter.

prepare_model(hf_model: Any) None

Patch OLMoE’s in-place clamp_ to avoid backward hook conflicts.

Same issue as OLMo v1 — see OlmoArchitectureAdapter.prepare_model.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for OLMoE component testing.

OLMoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace OLMoE model instance

  • bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.OpenElmArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Apple OpenELM models.

OpenELM uses a unique architecture with per-layer varying head counts and FFN dimensions. Key characteristics:

  • Combined QKV projection (qkv_proj) with per-layer varying Q/KV head counts

  • Gated MLP with combined gate+up projection (proj_1) and per-layer FFN sizes

  • RMSNorm normalization

  • Full rotary embeddings (per-layer, not shared)

  • Optional Q/K RMSNorm (normalize_qk_projections=True)

  • Weight tying (share_input_output_layers=True typically)

  • Model root is ‘transformer’ (not ‘model’)

  • Requires trust_remote_code=True (custom HF code)

The native HF attention handles all per-layer dimension variations, RoPE, GQA group repeat, and Q/K normalization internally. The bridge delegates to the native forward for correct computation.

Note: Individual Q/K/V hooks are not available since the model uses a combined QKV projection. Attention-level hooks (hook_attn_in, hook_attn_out) are provided.

__init__(cfg: Any) None

Initialize the OpenELM architecture adapter.

prepare_loading(model_name: str, model_kwargs: dict) None

Patch OpenELM for compatibility with transformers v5.

Two patches are needed: 1. RotaryEmbedding: Custom _compute_sin_cos_embeddings fails on meta device

because it calls .cos() on meta tensors. We wrap it to catch NotImplementedError.

  1. Weight re-initialization: OpenELM’s _init_weights re-randomizes ALL weights after they’ve been loaded from safetensors because transformers v5’s _finalize_load_state_dict calls initialize_weights() on modules lacking the _is_hf_initialized flag. We patch _init_weights to skip real (non-meta) tensors.

Parameters:
  • model_name – The HuggingFace model name/path

  • model_kwargs – The kwargs dict for from_pretrained()

prepare_model(hf_model: Any) None

Post-load fixes for non-persistent buffers zeroed during meta materialization.

Transformers v5 creates models on meta device then materializes weights from checkpoint. Non-persistent buffers (registered with persistent=False) are NOT in the checkpoint, so they materialize as zeros. OpenELM has two critical non-persistent buffers that must be recomputed:

  1. RoPE inv_freq — zeroed inv_freq produces cos=1, sin=0 for all positions, destroying positional information entirely.

  2. causal_mask — zeroed mask means no causal masking, allowing all positions to attend to future tokens. Single forward passes appear correct (no future tokens to leak) but autoregressive generation degenerates immediately.

We also create a synthetic lm_head for weight-tied models.

Note: We intentionally do NOT restore the original _compute_sin_cos_embeddings. The safe_compute wrapper is functionally equivalent for real (non-meta) tensors, and keeping it avoids issues when multiple models are loaded in the same process (e.g., benchmark suite loading both HF reference and bridge models).

Parameters:

hf_model – The loaded HuggingFace OpenELM model

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up references for OpenELM component testing.

Parameters:
  • hf_model – The HuggingFace OpenELM model instance

  • bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.OptArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for OPT models.

__init__(cfg: Any) None

Initialize the OPT architecture adapter.

class transformer_lens.model_bridge.supported_architectures.Phi3ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Phi-3 models.

__init__(cfg: Any) None

Initialize the Phi-3 architecture adapter.

Parameters:

cfg – The configuration object.

prepare_loading(model_name: str, model_kwargs: dict) None

Patch cached Phi-3 remote code for transformers v5 compatibility.

preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]

Fold layer norms into joint QKV/gate_up projections.

Standard fold_ln can’t handle joint projections (shape mismatch on round-trip), so we scale the full joint weights directly.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Phi-3 component testing.

Parameters:
  • hf_model – The HuggingFace Phi-3 model instance

  • bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.PhiArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Phi models.

__init__(cfg: Any) None

Initialize the Phi architecture adapter.

Parameters:

cfg – The configuration object.

default_cfg: dict[str, Any] = {'use_fast': False}
setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Phi component testing.

Phi uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace Phi model instance

  • bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.PythiaArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Pythia models.

__init__(cfg: Any) None

Initialize the Pythia architecture adapter.

Parameters:

cfg – The configuration object.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Pythia component testing.

Pythia uses RoPE (Rotary Position Embeddings) in the GPT-NeoX architecture. We need to set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace Pythia model instance

  • bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

split_qkv_matrix(original_attention_component: Any) tuple[Linear, Linear, Linear]

Split the QKV matrix into separate linear transformations.

GPT-NeoX/Pythia uses an interleaved QKV format where the weights are stored as [Q_h0, K_h0, V_h0, Q_h1, K_h1, V_h1, …] - i.e., Q, K, V are interleaved per head.

The weight shape is [n_heads * 3 * d_head, d_model] and the output is reshaped by HuggingFace as [batch, seq, n_heads, 3*d_head] then split on the last dim.

Parameters:

original_attention_component – The original attention layer component

Returns:

Tuple of nn.Linear modules for Q, K, and V transformations

class transformer_lens.model_bridge.supported_architectures.Qwen2ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Qwen2 models.

Optional Parameters (may not exist in state_dict):

Qwen2 models do NOT have biases on any linear layers:

  • blocks.{i}.attn.b_Q - No bias on query projection

  • blocks.{i}.attn.b_K - No bias on key projection

  • blocks.{i}.attn.b_V - No bias on value projection

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.mlp.b_in - No bias on MLP input (up_proj)

  • blocks.{i}.mlp.b_gate - No bias on MLP gate projection

  • blocks.{i}.mlp.b_out - No bias on MLP output (down_proj)

  • blocks.{i}.ln1.b - RMSNorm has no bias

  • blocks.{i}.ln2.b - RMSNorm has no bias

  • ln_final.b - RMSNorm has no bias

Weight processing must handle these missing biases gracefully using ProcessWeights._safe_get_tensor() or by checking for None values.

__init__(cfg: Any) None

Initialize the Qwen2 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Qwen2 component testing.

Qwen2 uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace Qwen2 model instance

  • bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)

class transformer_lens.model_bridge.supported_architectures.Qwen3ArchitectureAdapter(cfg: Any, *, hybrid: bool = False)

Bases: ArchitectureAdapter

Architecture adapter for Qwen3 dense models.

RMSNorm, RoPE, GQA, Q/K head norms, gated MLP. No biases. Serves as base class for Qwen3.5 and Qwen3Next hybrid variants.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set eager attn on HF model and rotary_emb on attention bridges.

class transformer_lens.model_bridge.supported_architectures.Qwen3MoeArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Qwen3MoE (Mixture of Experts) models.

Qwen3MoE is a sparse MoE decoder-only Transformer, structurally close to OLMoE. Key features:

  • Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.

  • Q/K normalization: RMSNorm applied to queries and keys after projection.

  • Sparse MoE: 128 experts with top-8 routing (public 30B-A3B checkpoints).

  • Batched expert parameters: gate_up_proj and down_proj as single 3D tensors, not a ModuleList.

  • final_rms=True (Qwen3-style; OLMoE uses False).

  • No biases on any projections.

  • GQA: n_key_value_heads < n_heads in all public checkpoints.

Only the all-MoE configuration is supported (decoder_sparse_step=1, mlp_only_layers=[]). Models with dense fallback layers cannot be wrapped because MoEBridge does not handle the dense Qwen3MoeMLP path.

Optional Parameters (may not exist in state_dict):

  • blocks.{i}.attn.b_Q - No bias on query projection

  • blocks.{i}.attn.b_K - No bias on key projection

  • blocks.{i}.attn.b_V - No bias on value projection

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.ln1.b - RMSNorm has no bias

  • blocks.{i}.ln2.b - RMSNorm has no bias

  • ln_final.b - RMSNorm has no bias

__init__(cfg: Any) None

Initialize the Qwen3MoE architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Qwen3MoE component testing.

Qwen3MoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace Qwen3MoE model instance

  • bridge_model – The TransformerBridge model (if available)

class transformer_lens.model_bridge.supported_architectures.Qwen3NextArchitectureAdapter(cfg: Any)

Bases: Qwen3ArchitectureAdapter

Hybrid linear-attention + full-attention with sparse MoE MLP.

Same hybrid design as Qwen3.5 but with MoE instead of dense MLP.

preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]

Slice query half from gated q_proj.weight for weight-space analysis.

class transformer_lens.model_bridge.supported_architectures.Qwen3_5ArchitectureAdapter(cfg: Any)

Bases: Qwen3ArchitectureAdapter

Hybrid linear-attention + full-attention with dense gated MLP.

Inherits Qwen3 config/attention/MLP structure. Differences: - Attention + linear_attn are optional (per-layer type) - Gated q_proj (2x wide) sliced by preprocess_weights for weight analysis

prepare_loading(model_name: str, model_kwargs: dict) None

Swap multimodal Qwen3_5Config for text-only Qwen3_5TextConfig.

Published checkpoints carry architectures=[‘Qwen3_5ForConditionalGeneration’]. We replace config with text_config so AutoModelForCausalLM loads the text-only Qwen3_5ForCausalLM.

preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]

Slice query half from gated q_proj.weight for weight-space analysis.

In processed mode, W_Q is the pure query projection (for composition scores, logit lens). Gate signal available in unprocessed mode on full-attention layers via blocks.N.attn.hook_q_gate.

class transformer_lens.model_bridge.supported_architectures.QwenArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Qwen models.

__init__(cfg: Any) None

Initialize the Qwen architecture adapter.

class transformer_lens.model_bridge.supported_architectures.StableLmArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for StableLM models.

StableLM uses a Llama-like architecture with separate Q/K/V projections and gated MLP, but differs in using standard LayerNorm (not RMSNorm) and partial rotary embeddings (25% of head dimensions by default).

Supports optional features: - Grouped Query Attention (num_key_value_heads != num_attention_heads) - QKV bias (use_qkv_bias=True on some models like stable-code-3b) - Parallel residual connections (use_parallel_residual=True) - Per-head QK LayerNorm (qk_layernorm=True)

Optional Parameters (may not exist in state_dict):

  • blocks.{i}.attn.b_Q - Only present when use_qkv_bias=True

  • blocks.{i}.attn.b_K - Only present when use_qkv_bias=True

  • blocks.{i}.attn.b_V - Only present when use_qkv_bias=True

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.mlp.b_in - No bias on MLP up_proj

  • blocks.{i}.mlp.b_gate - No bias on MLP gate_proj

  • blocks.{i}.mlp.b_out - No bias on MLP down_proj

__init__(cfg: Any) None

Initialize the StableLM architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for StableLM component testing.

StableLM uses RoPE (Rotary Position Embeddings) with partial rotation. We set the rotary_emb reference on all attention bridge instances and force eager attention for numerical consistency.

Parameters:
  • hf_model – The HuggingFace StableLM model instance

  • bridge_model – The TransformerBridge model (if available)

setup_hook_compatibility(bridge: Any) None

Inject hook points for QK LayerNorm on models with qk_layernorm=True.

StableLM v2 models (e.g., stablelm-2-12b) apply per-head LayerNorm to Q and K after projection but before rotary embedding. The native HF attention handles this internally, but we inject hooks so researchers can observe/intervene on the post-norm Q/K values.

Adds to each attention bridge:
  • hook_q_layernorm: fires after q_layernorm(query_states)

  • hook_k_layernorm: fires after k_layernorm(key_states)

This runs during bridge __init__ via _setup_hook_compatibility(), after component setup but before hook registry finalization. The hook registry scanner skips _original_component subtrees, so we register hooks directly in bridge._hook_registry with canonical TL-style names.

Parameters:

bridge – The TransformerBridge instance (fully initialized)

class transformer_lens.model_bridge.supported_architectures.T5ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for T5 models.

T5 is an encoder-decoder model with: - Shared embeddings - Encoder stack (self-attention + FFN) - Decoder stack (self-attention + cross-attention + FFN) - Language modeling head

Supports both standard T5 (DenseReluDense with wi/wo) and gated variants like Flan-T5 (T5DenseGatedActDense with wi_0/wi_1/wo).

__init__(cfg: Any) None

Initialize the T5 architecture adapter.

Parameters:

cfg – The configuration object.

class transformer_lens.model_bridge.supported_architectures.XGLMArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for XGLM models.

XGLM uses pre-norm LayerNorm, sinusoidal positional embeddings (no learnable weights), standard MHA with separate q/k/v/out_proj, and a 2-layer MLP (fc1/fc2) that lives directly on the decoder block rather than inside an mlp sub-module.

All attention projections and fc1/fc2 carry biases. lm_head has no bias. Embeddings are scaled by sqrt(d_model) at runtime in XGLMScaledWordEmbedding.

Optional Parameters (may not exist in state_dict):

None — all published XGLM checkpoints include all parameters listed above.

__init__(cfg: Any) None

Initialize the XGLM architecture adapter.

setup_hook_compatibility(bridge: Any) None

Scale hook_embed by sqrt(d_model) to match XGLMScaledWordEmbedding.forward().

XGLMScaledWordEmbedding multiplies the embedding lookup by embed_scale = sqrt(d_model) at runtime. Without this override, hook_embed would capture the raw (unscaled) table output, diverging from actual model activations.