transformer_lens.model_bridge.supported_architectures.qwen3_5_multimodal module

Qwen3.5 multimodal (vision-language) adapter for Qwen3_5ForConditionalGeneration.

Reuses the text-only Qwen3.5 hybrid backbone nested under model.language_model and adds the vision tower (model.visual) + merger. The HF model runs the vision computation during forward; this adapter only supplies the component mapping (hooks + weights).

class transformer_lens.model_bridge.supported_architectures.qwen3_5_multimodal.Qwen3_5MultimodalArchitectureAdapter(cfg: Any)

Bases: Qwen3ArchitectureAdapter

Full vision-language adapter for Qwen3_5ForConditionalGeneration.

component_mapping: ComponentMapping | None
preprocess_weights(state_dict: dict[str, Tensor]) dict[str, Tensor]

Slice query half from gated q_proj.weight (matcher is path-prefix-agnostic).

required_libraries: list[str] = ['torchvision']
required_libraries_group: str = 'multimodal'
setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Set eager attn and rotary_emb refs for the nested language model.

Hybrid: only full-attention layers have self_attn/attn; linear-attention layers are skipped.

uses_split_attention: bool
weight_processing_conversions: dict