transformer_lens.benchmarks.multimodal module

Multimodal benchmarks for TransformerBridge.

Tests that multimodal models (LLaVA, Gemma3, etc.) correctly handle image inputs through forward(), generate(), and run_with_cache().

transformer_lens.benchmarks.multimodal.benchmark_multimodal_cache(bridge: TransformerBridge, test_text: str = 'Describe this image.', reference_model=None) BenchmarkResult

Benchmark run_with_cache() with pixel_values for multimodal models.

Tests that running with cache and image input populates the activation cache, including vision encoder hooks if present.

Parameters:
  • bridge – TransformerBridge model to test.

  • test_text – Text prompt.

  • reference_model – Not used, kept for API compatibility.

Returns:

BenchmarkResult with cache details.

transformer_lens.benchmarks.multimodal.benchmark_multimodal_forward(bridge: TransformerBridge, test_text: str = 'Describe this image.', reference_model=None) BenchmarkResult

Benchmark forward() with pixel_values for multimodal models.

Tests that passing pixel_values produces valid logits (non-NaN, correct shape).

Parameters:
  • bridge – TransformerBridge model to test.

  • test_text – Text prompt (used as fallback if processor unavailable).

  • reference_model – Not used, kept for API compatibility.

Returns:

BenchmarkResult with forward pass details.

transformer_lens.benchmarks.multimodal.benchmark_multimodal_generation(bridge: TransformerBridge, test_text: str = 'Describe this image.', max_new_tokens: int = 10, reference_model=None) BenchmarkResult

Benchmark generate() with pixel_values for multimodal models.

Tests that generation with image input produces text output longer than input.

Parameters:
  • bridge – TransformerBridge model to test.

  • test_text – Text prompt.

  • max_new_tokens – Number of tokens to generate.

  • reference_model – Not used, kept for API compatibility.

Returns:

BenchmarkResult with generation details.