transformer_lens.model_bridge.generalized_components.clip_vision_encoder module¶
CLIP Vision Encoder bridge component.
This module contains the bridge component for CLIP vision encoder layers used in multimodal models like LLava.
- class transformer_lens.model_bridge.generalized_components.clip_vision_encoder.CLIPVisionEncoderBridge(name: str, config: Any | None = None, submodules: Dict[str, GeneralizedComponent] | None = None)¶
Bases:
GeneralizedComponentBridge for the complete CLIP vision encoder.
The CLIP vision tower consists of: - vision_model.embeddings: Patch + position + CLS token embeddings - vision_model.pre_layrnorm: LayerNorm before encoder layers - vision_model.encoder.layers[]: Stack of encoder layers - vision_model.post_layernorm: Final layer norm
This bridge wraps the entire vision tower to provide hooks for interpretability of the vision processing pipeline.
- __init__(name: str, config: Any | None = None, submodules: Dict[str, GeneralizedComponent] | None = None)¶
Initialize the CLIP vision encoder bridge.
- Parameters:
name – The name of this component (e.g., “vision_tower”)
config – Optional configuration object
submodules – Dictionary of submodules to register
- forward(pixel_values: Tensor, **kwargs: Any) Tensor¶
Forward pass through the vision encoder.
- Parameters:
pixel_values – Input image tensor [batch, channels, height, width]
**kwargs – Additional arguments
- Returns:
Vision embeddings [batch, num_patches, hidden_size]
- hook_aliases: Dict[str, str | List[str]] = {'hook_vision_embed': 'embeddings.hook_out', 'hook_vision_out': 'hook_out'}¶
- real_components: Dict[str, tuple]¶
- training: bool¶
- class transformer_lens.model_bridge.generalized_components.clip_vision_encoder.CLIPVisionEncoderLayerBridge(name: str, config: Any | None = None, submodules: Dict[str, GeneralizedComponent] | None = None)¶
Bases:
GeneralizedComponentBridge for a single CLIP encoder layer.
CLIP encoder layers have: - layer_norm1: LayerNorm - self_attn: CLIPAttention - layer_norm2: LayerNorm - mlp: CLIPMLP
- __init__(name: str, config: Any | None = None, submodules: Dict[str, GeneralizedComponent] | None = None)¶
Initialize the CLIP encoder layer bridge.
- Parameters:
name – The name of this component (e.g., “encoder.layers”)
config – Optional configuration object
submodules – Dictionary of submodules to register
- forward(hidden_states: Tensor, attention_mask: Tensor | None = None, causal_attention_mask: Tensor | None = None, **kwargs: Any) Tensor¶
Forward pass through the vision encoder layer.
- Parameters:
hidden_states – Input hidden states from previous layer
attention_mask – Optional attention mask
causal_attention_mask – Optional causal attention mask (used by CLIP encoder)
**kwargs – Additional arguments
- Returns:
Output hidden states
- hook_aliases: Dict[str, str | List[str]] = {'hook_attn_in': 'attn.hook_in', 'hook_attn_out': 'attn.hook_out', 'hook_mlp_in': 'mlp.hook_in', 'hook_mlp_out': 'mlp.hook_out', 'hook_resid_post': 'hook_out', 'hook_resid_pre': 'hook_in'}¶
- is_list_item: bool = True¶
- real_components: Dict[str, tuple]¶
- training: bool¶