transformer_lens.benchmarks.multimodal module¶

Multimodal benchmarks for TransformerBridge.

Tests that multimodal models (LLaVA, Gemma3, etc.) correctly handle image inputs through forward(), generate(), and run_with_cache().

transformer_lens.benchmarks.multimodal.benchmark_multimodal_cache(bridge: TransformerBridge, test_text: str = 'Describe this image.', reference_model=None) → BenchmarkResult¶

Benchmark run_with_cache() with pixel_values for multimodal models.

Tests that running with cache and image input populates the activation cache, including vision encoder hooks if present.

Parameters:

bridge – TransformerBridge model to test.
test_text – Text prompt.
reference_model – Not used, kept for API compatibility.

Returns:

BenchmarkResult with cache details.

transformer_lens.benchmarks.multimodal.benchmark_multimodal_forward(bridge: TransformerBridge, test_text: str = 'Describe this image.', reference_model=None) → BenchmarkResult¶

Benchmark forward() with pixel_values for multimodal models.

Tests that passing pixel_values produces valid logits (non-NaN, correct shape).

Parameters:

bridge – TransformerBridge model to test.
test_text – Text prompt (used as fallback if processor unavailable).
reference_model – Not used, kept for API compatibility.

Returns:

BenchmarkResult with forward pass details.

transformer_lens.benchmarks.multimodal.benchmark_multimodal_generation(bridge: TransformerBridge, test_text: str = 'Describe this image.', max_new_tokens: int = 10, reference_model=None) → BenchmarkResult¶

Benchmark generate() with pixel_values for multimodal models.

Tests that generation with image input produces text output longer than input.

Parameters:

bridge – TransformerBridge model to test.
test_text – Text prompt.
max_new_tokens – Number of tokens to generate.
reference_model – Not used, kept for API compatibility.

Returns:

BenchmarkResult with generation details.