transformer_lens.tools.model_registry.verify_models module

Batch model verification tool for the TransformerLens model registry.

Iterates through supported models, estimates memory requirements, runs benchmarks phase-by-phase, and updates the registry with status, phase scores, and notes.

Usage:

python -m transformer_lens.tools.model_registry.verify_models [options]

Examples

# Dry run to see what would be tested python -m transformer_lens.tools.model_registry.verify_models –dry-run

# Verify top 10 models per architecture on CPU python -m transformer_lens.tools.model_registry.verify_models –device cpu

# Verify only GPT2 models, limit to 3 python -m transformer_lens.tools.model_registry.verify_models –architectures GPT2LMHeadModel –limit 3

# Resume from a previous interrupted run python -m transformer_lens.tools.model_registry.verify_models –resume

# Re-verify already-tested models for a specific architecture python -m transformer_lens.tools.model_registry.verify_models –reverify –architectures Olmo2ForCausalLM

class transformer_lens.tools.model_registry.verify_models.ModelCandidate(model_id: str, architecture_id: str, estimated_params: int | None = None, estimated_memory_gb: float | None = None)

Bases: object

A model selected for verification.

architecture_id: str
estimated_memory_gb: float | None = None
estimated_params: int | None = None
model_id: str
class transformer_lens.tools.model_registry.verify_models.VerificationProgress(tested: list[str] = <factory>, skipped: list[str] = <factory>, failed: list[str] = <factory>, verified: list[str] = <factory>, start_time: str | None = None)

Bases: object

Tracks progress across a verification run.

failed: list[str]
classmethod from_dict(data: dict) VerificationProgress
skipped: list[str]
start_time: str | None = None
tested: list[str]
to_dict() dict
verified: list[str]
transformer_lens.tools.model_registry.verify_models.estimate_benchmark_memory_gb(n_params: int, dtype: str = 'float32', phases: list[int] | None = None, use_hf_reference: bool = True) float

Estimate peak memory needed for benchmark suite.

Phases run sequentially, so peak memory is the maximum of any single phase, not the sum. The multiplier represents how many model copies exist at peak:

Phase 1 (HF ref on): HF ref + Bridge → 2.0x peak Phase 1 (HF ref off): Bridge only → 1.0x peak Phase 2: Bridge + HookedTransformer (separate copy) → 2.0x model + overhead Phase 3: Same as Phase 2 (processed versions) → 2.0x model + overhead Phase 4: Bridge + GPT-2 scorer (~500MB) → ~1.0x model + 0.5 GB

Parameters:
  • n_params – Number of model parameters

  • dtype – Data type for memory calculation

  • phases – Which phases will be run (None = all phases)

  • use_hf_reference – Whether Phase 1 loads an HF reference alongside the Bridge. Mirrors the --no-hf-reference CLI flag.

Returns:

Estimated peak memory in GB

transformer_lens.tools.model_registry.verify_models.estimate_model_params(model_id: str) int

Estimate parameter count using AutoConfig (lightweight, no model download).

Fetches only the config JSON (~KB) and computes n_params from dimensions using the same formula as HookedTransformerConfig.__post_init__.

Parameters:

model_id – HuggingFace model ID

Returns:

Estimated number of parameters

Raises:

Exception – If config cannot be fetched or parsed

transformer_lens.tools.model_registry.verify_models.get_available_memory_gb(device: str) float

Detect available memory on the target device.

Parameters:

device – “cpu” or “cuda”

Returns:

Available memory in GB

transformer_lens.tools.model_registry.verify_models.main() None

CLI entry point for batch model verification.

transformer_lens.tools.model_registry.verify_models.select_models_for_verification(per_arch: int = 10, architectures: list[str] | None = None, limit: int | None = None, resume_progress: VerificationProgress | None = None, retry_failed: bool = False, reverify: bool = False) list[ModelCandidate]

Select models for verification from the registry.

Loads supported_models.json (already sorted by downloads). Takes the top N unverified models per architecture.

Parameters:
  • per_arch – Maximum models to verify per architecture

  • architectures – Filter to specific architectures (None = all)

  • limit – Total model cap (None = no cap)

  • resume_progress – If resuming, skip already-tested models

  • retry_failed – If True, include previously failed models for re-testing

  • reverify – If True, ignore previous status and re-test all matching models

Returns:

List of ModelCandidate objects to verify

transformer_lens.tools.model_registry.verify_models.verify_models(candidates: list[ModelCandidate], device: str = 'cpu', max_memory_gb: float | None = None, dtype: str = 'float32', use_hf_reference: bool = True, use_ht_reference: bool = True, phases: list[int] | None = None, quiet: bool = False, progress: VerificationProgress | None = None) VerificationProgress

Run verification benchmarks on a list of model candidates.

Parameters:
  • candidates – Models to verify

  • device – Device for benchmarks

  • max_memory_gb – Memory limit (auto-detected if None)

  • dtype – Dtype for memory estimation

  • use_hf_reference – Whether to compare against HuggingFace model

  • use_ht_reference – Whether to compare against HookedTransformer

  • phases – Which benchmark phases to run (default: [1, 2, 3, 4])

  • quiet – Suppress verbose output

  • progress – Existing progress for resume

Returns:

VerificationProgress with results