transformer_lens.model_bridge.supported_architectures.gemma3 module¶
Gemma3 architecture adapter.
- class transformer_lens.model_bridge.supported_architectures.gemma3.Gemma3ArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Gemma3 models.
- __init__(cfg: Any) None¶
Initialize the Gemma3 architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references and native autograd for Gemma-3 component testing.
Gemma-3 uses dual RoPE (global + local). We set local RoPE (used by 85% of layers) on all attention bridge instances for component testing.
We also enable use_native_layernorm_autograd on all normalization bridges to ensure they delegate to HuggingFace’s exact implementation instead of using manual computation.
Additionally, we force the HF model to use “eager” attention to match the bridge’s implementation. The bridge uses “eager” to support output_attentions for hooks, while HF defaults to “sdpa”. These produce mathematically equivalent results but with small numerical differences due to different implementations.
Note: Layers 5, 11, 17, 23 use global RoPE but will use local in component tests. This is an acceptable tradeoff given the shared-instance constraint.
- Parameters:
hf_model – The HuggingFace Gemma-3 model instance
bridge_model – The TransformerBridge model (if available, set rotary_emb on actual instances)
- setup_hook_compatibility(bridge: Any) None¶
Setup hook compatibility for Gemma3 models.
Unlike Gemma1/Gemma2, Gemma3 uses Gemma3TextScaledWordEmbedding which scales embeddings by sqrt(d_model) INSIDE the embedding layer’s forward(). Therefore we do NOT need a hook_conversion — the embed.hook_out already captures the scaled output. Adding a conversion would double-scale.
(Gemma1/Gemma2 scale in GemmaModel.forward() AFTER the embedding layer, so their adapters correctly use EmbeddingScaleConversion to match HT.)
- Parameters:
bridge – The TransformerBridge instance