transformer_lens.model_bridge.supported_architectures.qwen3_moe module

Qwen3MoE (Mixture of Experts) architecture adapter.

class transformer_lens.model_bridge.supported_architectures.qwen3_moe.Qwen3MoeArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Qwen3MoE (Mixture of Experts) models.

Qwen3MoE is a sparse MoE decoder-only Transformer, structurally close to OLMoE. Key features:

  • Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.

  • Q/K normalization: RMSNorm applied to queries and keys after projection.

  • Sparse MoE: 128 experts with top-8 routing (public 30B-A3B checkpoints).

  • Batched expert parameters: gate_up_proj and down_proj as single 3D tensors, not a ModuleList.

  • final_rms=True (Qwen3-style; OLMoE uses False).

  • No biases on any projections.

  • GQA: n_key_value_heads < n_heads in all public checkpoints.

Only the all-MoE configuration is supported (decoder_sparse_step=1, mlp_only_layers=[]). Models with dense fallback layers cannot be wrapped because MoEBridge does not handle the dense Qwen3MoeMLP path.

Optional Parameters (may not exist in state_dict):

  • blocks.{i}.attn.b_Q - No bias on query projection

  • blocks.{i}.attn.b_K - No bias on key projection

  • blocks.{i}.attn.b_V - No bias on value projection

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.ln1.b - RMSNorm has no bias

  • blocks.{i}.ln2.b - RMSNorm has no bias

  • ln_final.b - RMSNorm has no bias

__init__(cfg: Any) None

Initialize the Qwen3MoE architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Qwen3MoE component testing.

Qwen3MoE uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace Qwen3MoE model instance

  • bridge_model – The TransformerBridge model (if available)