transformer_lens.model_bridge.supported_architectures.gemma3_multimodal module¶
Gemma3 Multimodal architecture adapter.
This adapter supports Gemma3ForConditionalGeneration, the vision-language variant of Gemma 3 used by models like MedGemma.
- class transformer_lens.model_bridge.supported_architectures.gemma3_multimodal.Gemma3MultimodalArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for Gemma3 multimodal models (Gemma3ForConditionalGeneration).
This adapter handles vision-language models like Gemma 3 4B/12B/27B and MedGemma. The model structure is: - model.vision_tower: SigLIP vision encoder - model.multi_modal_projector: Projects vision embeddings to language space - model.language_model: Gemma3TextModel (same as text-only Gemma 3) - lm_head: Output projection
The language model component follows the same patterns as Gemma3ArchitectureAdapter.
- __init__(cfg: Any) None¶
Initialize the Gemma3 multimodal architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for Gemma-3 multimodal component testing.
The language model uses dual RoPE (global + local) like text-only Gemma 3.
- Parameters:
hf_model – The HuggingFace Gemma-3 multimodal model instance
bridge_model – The TransformerBridge model (if available)
- setup_hook_compatibility(bridge: Any) None¶
Setup hook compatibility for Gemma3 multimodal models.
Like text-only Gemma 3, the multimodal model uses Gemma3TextScaledWordEmbedding which scales embeddings by sqrt(d_model) internally in its forward() method. No additional hook conversion is needed — adding one would double-scale the embeddings.
- Parameters:
bridge – The TransformerBridge instance