transformer_lens.tools.model_registry.hf_scraper module¶
HuggingFace model scraper for discovering compatible models.
This module queries the HuggingFace Hub API to find ALL models and categorize them by architecture - those supported by TransformerLens and those not yet supported.
The scraper works by: 1. Scanning ALL text-generation models on HuggingFace (paginated) 2. Extracting the architecture class from each model’s config 3. Categorizing models into supported vs unsupported based on TransformerLens adapters 4. Building comprehensive lists for both categories
Output format matches the schemas defined in schemas.py exactly, so the data files can be loaded by api.py without any transformation.
- Usage:
# Full scan of all HuggingFace models (recommended) python -m transformer_lens.tools.model_registry.hf_scraper –full-scan
# Quick scan (top N models by downloads) python -m transformer_lens.tools.model_registry.hf_scraper –limit 10000
# Output to custom directory python -m transformer_lens.tools.model_registry.hf_scraper –full-scan –output data/
- transformer_lens.tools.model_registry.hf_scraper.main()¶
- transformer_lens.tools.model_registry.hf_scraper.scrape_all_models(output_dir: Path, max_models: int | None = None, task: str = 'text-generation', batch_size: int = 1000, checkpoint_interval: int = 5000, min_downloads: int = 500) tuple[dict, dict]¶
Scrape ALL models from HuggingFace and categorize by architecture.
This is the comprehensive scraper that: 1. Loads existing models from supported_models.json to preserve them 2. Skips models already in the JSON (only scans new models) 3. Iterates through ALL models for a given task 4. Fetches the architecture from each model’s config 5. Categorizes into supported vs unsupported 6. Saves checkpoints periodically for long runs
Output format matches schemas.py exactly (SupportedModelsReport and ArchitectureGapsReport).
- Parameters:
output_dir – Directory to write JSON data files
max_models – Maximum NEW models to scan (None = unlimited/all)
task – HuggingFace task filter (default: text-generation)
batch_size – Log progress every N models
checkpoint_interval – Save checkpoint every N models
min_downloads – Minimum download count to include a model (default: 500)
- Returns:
Tuple of (supported_models_dict, architecture_gaps_dict)