transformer_lens.model_bridge.supported_architectures.qwen3_moe module¶

Qwen3MoE (Mixture of Experts) architecture adapter.

class transformer_lens.model_bridge.supported_architectures.qwen3_moe.Qwen3MoeArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Qwen3MoE (Mixture of Experts) models.

Qwen3MoE is a sparse MoE decoder-only Transformer, structurally close to OLMoE. Key features:

Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Q/K normalization: RMSNorm applied to queries and keys after projection.
Sparse MoE: 128 experts with top-8 routing (public 30B-A3B checkpoints).
Batched expert parameters: gate_up_proj and down_proj as single 3D tensors, not a ModuleList.
final_rms=True (Qwen3-style; OLMoE uses False).
No biases on any projections.
GQA: n_key_value_heads < n_heads in all public checkpoints.

Only the all-MoE configuration is supported (decoder_sparse_step=1, mlp_only_layers=[]). Models with dense fallback layers cannot be wrapped because MoEBridge does not handle the dense Qwen3MoeMLP path.

Optional Parameters (may not exist in state_dict):¶

blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias

__init__(cfg: Any) → None¶: Initialize the Qwen3MoE architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Qwen3MoE component testing.

Qwen3MoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:

hf_model – The HuggingFace Qwen3MoE model instance
bridge_model – The TransformerBridge model (if available)