transformer_lens.model_bridge.supported_architectures.olmoe module¶

OLMoE (Mixture of Experts) architecture adapter.

class transformer_lens.model_bridge.supported_architectures.olmoe.OlmoeArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for OLMoE (Mixture of Experts) models.

OLMoE uses a pre-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE), and sparse Mixture of Experts MLP. Key features:

Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Q/K normalization: RMSNorm applied to queries and keys after projection.
Sparse MoE: 64 experts with top-8 routing (configurable).
Batched expert parameters: gate_up_proj [num_experts, 2*d_mlp, d_model] and down_proj [num_experts, d_model, d_mlp] as single tensors, not a ModuleList.
Optional QKV clipping (handled by HF’s native attention forward).
No biases on any projections.

Optional Parameters (may not exist in state_dict):¶

blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias

__init__(cfg: Any) → None¶: Initialize the OLMoE architecture adapter.

prepare_model(hf_model: Any) → None¶

Patch OLMoE’s in-place clamp_ to avoid backward hook conflicts.

Same issue as OLMo v1 — see OlmoArchitectureAdapter.prepare_model.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for OLMoE component testing.

OLMoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace OLMoE model instance
bridge_model – The TransformerBridge model (if available)