transformer_lens.model_bridge.supported_architectures.llava module¶
LLava architecture adapter.
This adapter supports LlavaForConditionalGeneration, the vision-language model combining a CLIP vision encoder with a LLaMA language model.
- class transformer_lens.model_bridge.supported_architectures.llava.LlavaArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for LLava multimodal models (LlavaForConditionalGeneration).
This adapter handles vision-language models like LLava 1.5. The model structure is: - model.vision_tower: CLIP vision encoder - model.multi_modal_projector: 2-layer MLP (Linear -> GELU -> Linear) - model.language_model: LlamaForCausalLM
model.language_model.model.embed_tokens
model.language_model.model.layers[]: LLaMA transformer blocks
model.language_model.model.norm
model.language_model.lm_head
The language model component follows the same patterns as LlamaArchitectureAdapter.
- __init__(cfg: Any) None¶
Initialize the LLava architecture adapter.
- setup_component_testing(hf_model: Any, bridge_model: Any = None) None¶
Set up rotary embedding references for LLava component testing.
LLava uses a LLaMA language backbone with RoPE. We set the rotary_emb reference on all attention bridge instances for component testing.
- Parameters:
hf_model – The HuggingFace LLava model instance
bridge_model – The TransformerBridge model (if available)