transformer_lens.tools.model_registry.hf_scraper module

HuggingFace model scraper for discovering compatible models.

This module queries the HuggingFace Hub API to find ALL models and categorize them by architecture - those supported by TransformerLens and those not yet supported.

The scraper works by: 1. Scanning ALL text-generation models on HuggingFace (paginated) 2. Extracting the architecture class from each model’s config 3. Categorizing models into supported vs unsupported based on TransformerLens adapters 4. Building comprehensive lists for both categories

Output format matches the schemas defined in schemas.py exactly, so the data files can be loaded by api.py without any transformation.

Usage:

# Full scan of all HuggingFace models (recommended) python -m transformer_lens.tools.model_registry.hf_scraper –full-scan

# Quick scan (top N models by downloads) python -m transformer_lens.tools.model_registry.hf_scraper –limit 10000

# Output to custom directory python -m transformer_lens.tools.model_registry.hf_scraper –full-scan –output data/

transformer_lens.tools.model_registry.hf_scraper.main()
transformer_lens.tools.model_registry.hf_scraper.scrape_all_models(output_dir: Path, max_models: int | None = None, task: str = 'text-generation', batch_size: int = 1000, checkpoint_interval: int = 5000, min_downloads: int = 500) tuple[dict, dict]

Scrape ALL models from HuggingFace and categorize by architecture.

This is the comprehensive scraper that: 1. Loads existing models from supported_models.json to preserve them 2. Skips models already in the JSON (only scans new models) 3. Iterates through ALL models for a given task 4. Fetches the architecture from each model’s config 5. Categorizes into supported vs unsupported 6. Saves checkpoints periodically for long runs

Output format matches schemas.py exactly (SupportedModelsReport and ArchitectureGapsReport).

Parameters:
  • output_dir – Directory to write JSON data files

  • max_models – Maximum NEW models to scan (None = unlimited/all)

  • task – HuggingFace task filter (default: text-generation)

  • batch_size – Log progress every N models

  • checkpoint_interval – Save checkpoint every N models

  • min_downloads – Minimum download count to include a model (default: 500)

Returns:

Tuple of (supported_models_dict, architecture_gaps_dict)