transformer_lens.model_bridge.supported_architectures.gemma3_multimodal module¶

Gemma3 Multimodal architecture adapter.

This adapter supports Gemma3ForConditionalGeneration, the vision-language variant of Gemma 3 used by models like MedGemma.

class transformer_lens.model_bridge.supported_architectures.gemma3_multimodal.Gemma3MultimodalArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for Gemma3 multimodal models (Gemma3ForConditionalGeneration).

This adapter handles vision-language models like Gemma 3 4B/12B/27B and MedGemma. The model structure is: - model.vision_tower: SigLIP vision encoder - model.multi_modal_projector: Projects vision embeddings to language space - model.language_model: Gemma3TextModel (same as text-only Gemma 3) - lm_head: Output projection

The language model component follows the same patterns as Gemma3ArchitectureAdapter.

__init__(cfg: Any) → None¶: Initialize the Gemma3 multimodal architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) → None¶

Set up rotary embedding references for Gemma-3 multimodal component testing.

The language model uses dual RoPE (global + local) like text-only Gemma 3.

Parameters:

hf_model – The HuggingFace Gemma-3 multimodal model instance
bridge_model – The TransformerBridge model (if available)