HuggingFace Model Scraper¶
The HuggingFace model scraper (transformer_lens.tools.model_registry.hf_scraper) discovers HF models, extracts each model’s architecture from its config, and writes two registry files: supported_models.json (models whose architecture has a registered TransformerLens adapter) and architecture_gaps.json (unsupported architectures, scored by relevancy). Most adapter contributors will only need to run it after merging a new architecture adapter to populate the registry with that architecture’s models.
When to run¶
After merging a new adapter — to populate
supported_models.jsonwith HF models of that architecture soverify_modelshas things to verify.Periodically (maintainers) — to refresh the full registry as the HF Hub evolves and the relevancy ranking in
architecture_gaps.jsonshifts.When investigating an unsupported architecture — to see how many models exist for it and what the canonical examples are before deciding to write an adapter.
Setup¶
Sphinx, verify_models, and the scraper all need an HF token to read gated models. Source from the repo’s .env (see feedback_hf_token_env.md for the project convention):
set -a; source .env; set +a
Then run from the repo root with uv:
uv run python -m transformer_lens.tools.model_registry.hf_scraper [flags]
Common invocations¶
Targeted scrape — a single architecture¶
The most common workflow for adapter contributors. After registering a new adapter (see the four registration sites in contributing.md), populate the registry with models of just that architecture:
uv run python -m transformer_lens.tools.model_registry.hf_scraper \
--architecture LlamaForCausalLM --full-scan
How it queries the Hub. When --architecture names an architecture with registered canonical orgs (in CANONICAL_AUTHORS_BY_ARCH), the scraper skips the global --task pagination entirely and runs only the per-author canonical sweep. That’s the only HF-side narrowing available — the Hub API doesn’t expose config.architectures[0] as a filter, but it does expose author, and the canonical-orgs map is exact. For --architecture LlamaForCausalLM this means ~4 paginated list_models(author=...) calls instead of iterating every text-generation model on the Hub.
If the architecture is not in CANONICAL_AUTHORS_BY_ARCH (e.g., a new arch where you haven’t yet populated the canonical-orgs entry), the scraper falls back to the global scan with a client-side filter — slower, but still complete. You’ll see a log line at startup indicating which path was taken.
Existing entries in supported_models.json are preserved — the scraper appends, it does not overwrite. So a targeted scrape lands new entries for the requested architecture alongside everything that was already there.
The architecture string must match config.architectures[0] from the HF model’s config exactly — i.e., the same string the adapter is keyed under in SUPPORTED_ARCHITECTURES.
Trade-off vs. a full scan. The canonical-sweep-only path is fast and exhaustive within the canonical orgs, but it misses community fine-tunes from non-canonical orgs (e.g., a random user fine-tuning Llama-2-7b for a specific task). For populating the registry after adding a new adapter, that’s usually fine — foundation-org checkpoints are what verify_models should run against first. If you want fine-tune coverage too, follow up with a periodic full scan (no --architecture flag).
Full scan — all architectures¶
The comprehensive refresh. Iterates every model on the Hub matching --task, extracts each architecture, and updates both the supported and gaps reports. Saves checkpoints periodically; safe to interrupt and resume.
uv run python -m transformer_lens.tools.model_registry.hf_scraper --full-scan
Quick scan — top-N by downloads¶
Smoke test the scraper or refresh just the most-popular models:
uv run python -m transformer_lens.tools.model_registry.hf_scraper --limit 10000
Isolated/exploratory scrape¶
Write to a scratch directory to avoid touching the committed registry:
uv run python -m transformer_lens.tools.model_registry.hf_scraper \
--architecture <ArchClass> --full-scan -o ./tmp/scrape/
Output¶
Two JSON files are written to --output (default: transformer_lens/tools/model_registry/data/):
supported_models.json— one entry per model whoseconfig.architectures[0]matches a key inSUPPORTED_ARCHITECTURES. Each entry hasarchitecture_id,model_id,status, per-phase verification scores, and metadata. Schema:SupportedModelsReportinschemas.py.architecture_gaps.json— one entry per unsupported architecture, with model count, total downloads, smallest known parameter count, sample model IDs, and a computedrelevancy_score. Sorted by relevancy descending. Schema:ArchitectureGapsReportinschemas.py.
A verification_history.json placeholder is also written if it doesn’t already exist; verify_models is what actually populates it.
Flags¶
Flag |
Default |
Purpose |
|---|---|---|
|
none |
Only include models whose |
|
off |
Scan every model matching |
|
10000 |
Cap scan at N models (ignored with |
|
|
HF tag to filter by. Use |
|
|
Where to write JSON files. |
|
500 |
Skip models below this download threshold. Canonical-org models bypass this via the sweep. |
|
5000 |
Save a checkpoint every N scanned models. |
|
off |
Skip the post-scan sweep that admits canonical-org models below the download threshold. |
Resumption¶
The scraper saves checkpoints periodically and on Ctrl-C / network error / HTTP 429. Re-running with the same arguments resumes from the last checkpoint. Checkpoints live at <output>/scrape_checkpoint.json and are deleted on successful completion.
If you see a 429 mid-run, the scraper waits and retries automatically (up to 10 attempts, exponentially backed off). No action needed.
Workflow: adding a new adapter¶
Implement the adapter — see adapter-creation-guide.md.
Register it in the four places listed in contributing.md:
supported_architectures/__init__.py,factories/architecture_adapter_factory.py,tools/model_registry/__init__.py(bothHF_SUPPORTED_ARCHITECTURESandCANONICAL_AUTHORS_BY_ARCH), andtools/model_registry/generate_report.py.Run the registry-sync test to confirm the four sites agree:
uv run pytest tests/unit/tools/test_model_registry.py -k TestRegistrySyncedWithFactory
Run a targeted scrape to populate the registry with HF models of the new architecture:
set -a; source .env; set +a uv run python -m transformer_lens.tools.model_registry.hf_scraper \ --architecture <YourArchClass> --full-scan
Verify the discovered models with
verify_models, smallest-first:uv run python -m transformer_lens.tools.model_registry.verify_models --model <hf_repo>
Commit the updated
supported_models.json(and anyverification_history.jsonchanges from step 5).
Caveats¶
Rate limiting. HF Hub allows ~1000 requests per 5 minutes. The scraper uses
list_models(expand=['config', 'safetensors'])to fetch inline metadata, so it spends ~200 paginated calls on a full ~200K-model scan — well under the limit. The retry/backoff is there for transient blips, not as a workaround.Quantized variants are filtered. AWQ, GPTQ, GGUF, bnb, FP8 checkpoints are dropped at the
is_quantized_modelcheck. TransformerLens requires full-precision weights.--taskmatches HF tags, notpipeline_tag. Encoder-decoder models taggedtext2text-generationare discoverable under the default--task text-generationonly via tag overlap. For seq2seq architectures (T5, mT5), pass--task text2text-generationexplicitly to be safe.The architecture filter is exact-match against
config.architectures[0]. It does not accept aliases or partial matches. A targeted scrape forLlamaForCausalLMwill not surfaceLlamaModelcheckpoints (which have a different primary architecture string), nor variants likeLlama4ForCausalLMif they appear in the future.Existing registry data is preserved, not filtered. A targeted scrape adds new entries matching the filter; it does not remove unrelated entries from
supported_models.json. To inspect just the targeted architecture’s results in isolation, write to a scratch directory with-o.