transformer_lens.model_bridge.supported_architectures.qwen3_moe module¶
Qwen3MoE (Mixture of Experts) architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.qwen3_moe.Qwen3MoeArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Qwen3MoE (Mixture of Experts) models.
Qwen3MoE is a sparse MoE decoder-only Transformer, structurally close to OLMoE. Key features:
Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.
Q/K normalization: RMSNorm applied to queries and keys after projection.
Sparse MoE: 128 experts with top-8 routing (public 30B-A3B checkpoints).
Batched expert parameters: gate_up_proj and down_proj as single 3D tensors, not a ModuleList.
final_rms=True (Qwen3-style; OLMoE uses False).
No biases on any projections.
GQA: n_key_value_heads < n_heads in all public checkpoints.
Only the all-MoE configuration is supported (decoder_sparse_step=1, mlp_only_layers=[]). Models with dense fallback layers cannot be wrapped because MoEBridge does not handle the dense Qwen3MoeMLP path.
Optional Parameters (may not exist in state_dict):¶
blocks.{i}.attn.b_Q - No bias on query projection
blocks.{i}.attn.b_K - No bias on key projection
blocks.{i}.attn.b_V - No bias on value projection
blocks.{i}.attn.b_O - No bias on output projection
blocks.{i}.ln1.b - RMSNorm has no bias
blocks.{i}.ln2.b - RMSNorm has no bias
ln_final.b - RMSNorm has no bias
- __init__(cfg: Any) None¶
Initialize the Qwen3MoE architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Qwen3MoE component testing.
Qwen3MoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace Qwen3MoE model instance
bridge_model – The TransformerBridge model (if available)