transformer_lens.benchmarks.multimodal module¶
Multimodal benchmarks for TransformerBridge.
Tests that multimodal models (LLaVA, Gemma3, etc.) correctly handle image inputs through forward(), generate(), and run_with_cache().
- transformer_lens.benchmarks.multimodal.benchmark_multimodal_cache(bridge: TransformerBridge, test_text: str = 'Describe this image.', reference_model=None) BenchmarkResult¶
Benchmark run_with_cache() with pixel_values for multimodal models.
Tests that running with cache and image input populates the activation cache, including vision encoder hooks if present.
- Parameters:
bridge – TransformerBridge model to test.
test_text – Text prompt.
reference_model – Not used, kept for API compatibility.
- Returns:
BenchmarkResult with cache details.
- transformer_lens.benchmarks.multimodal.benchmark_multimodal_forward(bridge: TransformerBridge, test_text: str = 'Describe this image.', reference_model=None) BenchmarkResult¶
Benchmark forward() with pixel_values for multimodal models.
Tests that passing pixel_values produces valid logits (non-NaN, correct shape).
- Parameters:
bridge – TransformerBridge model to test.
test_text – Text prompt (used as fallback if processor unavailable).
reference_model – Not used, kept for API compatibility.
- Returns:
BenchmarkResult with forward pass details.
- transformer_lens.benchmarks.multimodal.benchmark_multimodal_generation(bridge: TransformerBridge, test_text: str = 'Describe this image.', max_new_tokens: int = 10, reference_model=None) BenchmarkResult¶
Benchmark generate() with pixel_values for multimodal models.
Tests that generation with image input produces text output longer than input.
- Parameters:
bridge – TransformerBridge model to test.
test_text – Text prompt.
max_new_tokens – Number of tokens to generate.
reference_model – Not used, kept for API compatibility.
- Returns:
BenchmarkResult with generation details.