transformer_lens.benchmarks.text_quality module

Text quality benchmark for TransformerBridge.

Generates text with the bridge model from multiple diverse prompts and scores each continuation’s legibility using GPT-2 as a perplexity-based judge. Only the generated continuation tokens are scored (prompt tokens are masked), and a repetition penalty is applied to catch degenerate looping output.

Generation is seeded for reproducibility, and the scoring model is loaded once and reused across all prompts.

transformer_lens.benchmarks.text_quality.benchmark_text_quality(bridge: TransformerBridge, test_text: str, max_new_tokens: int = 50, scoring_model_name: str = 'gpt2', pass_threshold: float = 85.0, device: str = 'cpu', scoring_model: PreTrainedModel | None = None, scoring_tokenizer: PreTrainedTokenizerBase | None = None) BenchmarkResult

Benchmark text generation quality using continuation-only perplexity scoring.

Generates text from multiple diverse prompts, scores each continuation using GPT-2 perplexity (prompt tokens masked), applies a repetition penalty, and returns the averaged score.

Parameters:
  • bridge – TransformerBridge model to test.

  • test_text – Primary input prompt (additional diverse prompts are also used).

  • max_new_tokens – Number of tokens to generate per prompt.

  • scoring_model_name – HuggingFace model to use as scorer.

  • pass_threshold – Minimum average score to pass (default 95.0).

  • device – Device for the scoring model.

  • scoring_model – Optional pre-loaded scoring model. When provided alongside scoring_tokenizer, skips loading and avoids cleanup (caller owns lifecycle).

  • scoring_tokenizer – Optional pre-loaded tokenizer for the scoring model.

Returns:

BenchmarkResult with quality score details.