transformer_lens.pretrained.weight_conversions.olmo3 module

Weight conversion functions for OLMo 3/3.1 models.

OLMo 3/3.1 architecture features: - Q/K normalization (RMSNorm on queries/keys before attention) - Grouped Query Attention (GQA) with n_key_value_heads < n_heads - Sliding window attention + full attention layers (mixed via layer_types) - RMSNorm throughout (no +1 modification unlike Gemma) - Rotary Position Embeddings (RoPE) with YARN scaling - Gated MLP (SwiGLU-style) - Post-normalization pattern (RMSNorm after attention and MLP)

transformer_lens.pretrained.weight_conversions.olmo3.convert_olmo3_weights(olmo3, cfg: HookedTransformerConfig)