transformer_lens.model_bridge.supported_architectures.olmoe module

OLMoE (Mixture of Experts) architecture adapter.

class transformer_lens.model_bridge.supported_architectures.olmoe.OlmoeArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for OLMoE (Mixture of Experts) models.

OLMoE uses a pre-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE), and sparse Mixture of Experts MLP. Key features:

  • Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.

  • Q/K normalization: RMSNorm applied to queries and keys after projection.

  • Sparse MoE: 64 experts with top-8 routing (configurable).

  • Batched expert parameters: gate_up_proj [num_experts, 2*d_mlp, d_model] and down_proj [num_experts, d_model, d_mlp] as single tensors, not a ModuleList.

  • Optional QKV clipping (handled by HF’s native attention forward).

  • No biases on any projections.

Optional Parameters (may not exist in state_dict):

  • blocks.{i}.attn.b_Q - No bias on query projection

  • blocks.{i}.attn.b_K - No bias on key projection

  • blocks.{i}.attn.b_V - No bias on value projection

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.ln1.b - RMSNorm has no bias

  • blocks.{i}.ln2.b - RMSNorm has no bias

  • ln_final.b - RMSNorm has no bias

__init__(cfg: Any) None

Initialize the OLMoE architecture adapter.

prepare_model(hf_model: Any) None

Patch OLMoE’s in-place clamp_ to avoid backward hook conflicts.

Same issue as OLMo v1 — see OlmoArchitectureAdapter.prepare_model.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for OLMoE component testing.

OLMoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace OLMoE model instance

  • bridge_model – The TransformerBridge model (if available)