transformer_lens.model_bridge.supported_architectures.olmoe module¶
OLMoE (Mixture of Experts) architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.olmoe.OlmoeArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for OLMoE (Mixture of Experts) models.
OLMoE uses a pre-norm architecture with RMSNorm, Q/K normalization in attention, rotary position embeddings (RoPE), and sparse Mixture of Experts MLP. Key features:
Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Q/K normalization: RMSNorm applied to queries and keys after projection.
Sparse MoE: 64 experts with top-8 routing (configurable).
Batched expert parameters: gate_up_proj [num_experts, 2*d_mlp, d_model] and down_proj [num_experts, d_model, d_mlp] as single tensors, not a ModuleList.
Optional QKV clipping (handled by HF’s native attention forward).
No biases on any projections.
Optional Parameters (may not exist in state_dict):¶
blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias
- __init__(cfg: Any) None¶
Initialize the OLMoE architecture adapter.
- prepare_model(hf_model: Any) None¶
Patch OLMoE’s in-place clamp_ to avoid backward hook conflicts.
Same issue as OLMo v1 — see OlmoArchitectureAdapter.prepare_model.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for OLMoE component testing.
OLMoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace OLMoE model instance
bridge_model – The TransformerBridge model (if available)