transformer_lens.tools.model_registry.hf_scraper module¶

HuggingFace model scraper for discovering compatible models.

This module queries the HuggingFace Hub API to find ALL models and categorize them by architecture - those supported by TransformerLens and those not yet supported.

The scraper works by: 1. Scanning ALL text-generation models on HuggingFace (paginated) 2. Extracting the architecture class from each model’s config 3. Categorizing models into supported vs unsupported based on TransformerLens adapters 4. Building comprehensive lists for both categories

Output format matches the schemas defined in schemas.py exactly, so the data files can be loaded by api.py without any transformation.

Usage:

# Full scan of all HuggingFace models (recommended) python -m transformer_lens.tools.model_registry.hf_scraper –full-scan

# Targeted scrape: only models of a specific architecture python -m transformer_lens.tools.model_registry.hf_scraper

–architecture LlamaForCausalLM –full-scan

# Quick scan (top N models by downloads) python -m transformer_lens.tools.model_registry.hf_scraper –limit 10000

# Output to custom directory python -m transformer_lens.tools.model_registry.hf_scraper –full-scan –output data/

transformer_lens.tools.model_registry.hf_scraper.main()¶

transformer_lens.tools.model_registry.hf_scraper.scrape_all_models(output_dir: Path, max_models: int | None = None, task: str = 'text-generation', batch_size: int = 1000, checkpoint_interval: int = 5000, min_downloads: int = 500, canonical_sweep: bool = True, architecture: str | None = None) → tuple[dict, dict]¶

Scrape ALL models from HuggingFace and categorize by architecture.

This is the comprehensive scraper that: 1. Loads existing models from supported_models.json to preserve them 2. Skips models already in the JSON (only scans new models) 3. Iterates through ALL models for a given task 4. Fetches the architecture from each model’s config 5. Categorizes into supported vs unsupported 6. Saves checkpoints periodically for long runs

Output format matches schemas.py exactly (SupportedModelsReport and ArchitectureGapsReport).

Parameters:

output_dir – Directory to write JSON data files
max_models – Maximum NEW models to scan (None = unlimited/all)
task – HuggingFace task filter (default: text-generation)
batch_size – Log progress every N models
checkpoint_interval – Save checkpoint every N models
min_downloads – Minimum download count to include a model (default: 500)
canonical_sweep – If True, run the post-scrape pass that admits canonical-org models below the download threshold (default: True).
architecture – If set, only include models whose config.architectures[0] matches this class (e.g. "LlamaForCausalLM"). Applies to both the main scan and the canonical-author sweep. Useful for populating the registry after adding a single new adapter without rescanning every architecture.

Returns:

Tuple of (supported_models_dict, architecture_gaps_dict)