transformer_lens.model_bridge.supported_architectures.mixtral module

Mixtral architecture adapter.

class transformer_lens.model_bridge.supported_architectures.mixtral.MixtralArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Mixtral models.

Mixtral uses a pre-norm architecture with RMSNorm, rotary position embeddings (RoPE), and a Sparse Mixture of Experts MLP. Key features:

  • Pre-norm: RMSNorm applied BEFORE attention and BEFORE MLP.

  • Rotary embeddings: stored at model.rotary_emb and passed per-forward-call.

  • Sparse MoE: batched expert parameters (gate_up_proj, down_proj as 3D tensors).

  • MixtralAttention.forward() requires position_embeddings and attention_mask args.

  • Optional GQA (n_key_value_heads may differ from n_heads).

__init__(cfg: Any) None

Initialize the Mixtral architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Mixtral component testing.

Mixtral uses RoPE (Rotary Position Embeddings). We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace Mixtral model instance

  • bridge_model – The TransformerBridge model (if available)