transformer_lens.model_bridge.supported_architectures.gemma3_multimodal module

Gemma3 Multimodal architecture adapter.

This adapter supports Gemma3ForConditionalGeneration, the vision-language variant of Gemma 3 used by models like MedGemma.

class transformer_lens.model_bridge.supported_architectures.gemma3_multimodal.Gemma3MultimodalArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for Gemma3 multimodal models (Gemma3ForConditionalGeneration).

This adapter handles vision-language models like Gemma 3 4B/12B/27B and MedGemma. The model structure is: - model.vision_tower: SigLIP vision encoder - model.multi_modal_projector: Projects vision embeddings to language space - model.language_model: Gemma3TextModel (same as text-only Gemma 3) - lm_head: Output projection

The language model component follows the same patterns as Gemma3ArchitectureAdapter.

__init__(cfg: Any) None

Initialize the Gemma3 multimodal architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for Gemma-3 multimodal component testing.

The language model uses dual RoPE (global + local) like text-only Gemma 3.

Parameters:
  • hf_model – The HuggingFace Gemma-3 multimodal model instance

  • bridge_model – The TransformerBridge model (if available)

setup_hook_compatibility(bridge: Any) None

Setup hook compatibility for Gemma3 multimodal models.

Like text-only Gemma 3, the multimodal model uses Gemma3TextScaledWordEmbedding which scales embeddings by sqrt(d_model) internally in its forward() method. No additional hook conversion is needed — adding one would double-scale the embeddings.

Parameters:

bridge – The TransformerBridge instance