transformer_lens.model_bridge.supported_architectures.llava module

LLava architecture adapter.

This adapter supports LlavaForConditionalGeneration, the vision-language model combining a CLIP vision encoder with a LLaMA language model.

class transformer_lens.model_bridge.supported_architectures.llava.LlavaArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for LLava multimodal models (LlavaForConditionalGeneration).

This adapter handles vision-language models like LLava 1.5. The model structure is: - model.vision_tower: CLIP vision encoder - model.multi_modal_projector: 2-layer MLP (Linear -> GELU -> Linear) - model.language_model: LlamaForCausalLM

  • model.language_model.model.embed_tokens

  • model.language_model.model.layers[]: LLaMA transformer blocks

  • model.language_model.model.norm

  • model.language_model.lm_head

The language model component follows the same patterns as LlamaArchitectureAdapter.

__init__(cfg: Any) None

Initialize the LLava architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set up rotary embedding references for LLava component testing.

LLava uses a LLaMA language backbone with RoPE. We set the rotary_emb reference on all attention bridge instances for component testing.

Parameters:
  • hf_model – The HuggingFace LLava model instance

  • bridge_model – The TransformerBridge model (if available)