HuggingFace Model Analysis Guide¶
This guide explains how to analyze a HuggingFace model to extract the information needed to build a TransformerLens Architecture Adapter.
Read the model’s config.json¶
Every HF model has a config.json that contains architecture details. You can access it via:
from transformers import AutoConfig
config = AutoConfig.from_pretrained("model-name-or-path")
print(config)
Or via the HuggingFace API:
curl -s "https://huggingface.co/model-name/resolve/main/config.json" | python -m json.tool
Key config fields to extract¶
HF Config Field |
TL Config Field |
Description |
|---|---|---|
|
|
Model dimension |
|
|
Number of attention heads |
|
|
KV heads (for GQA; if absent or equal to n_heads, not GQA) |
|
|
MLP intermediate dimension |
|
|
Number of transformer blocks |
|
|
Vocabulary size |
|
|
Maximum sequence length |
|
|
Normalization epsilon |
|
— |
Architecture family (e.g., “llama”, “gpt2”, “mistral”) |
|
|
HF class name (e.g., |
Determine architecture characteristics¶
Normalization type¶
Check the model code or config:
RMSNorm →
normalization_type = "RMS"— Look forRMSNormin the model code, orrms_norm_epsin configLayerNorm →
normalization_type = "LN"— Look forLayerNorm, orlayer_norm_eps/layer_norm_epsilonin config
Also identify the epsilon attribute name:
"variance_epsilon"(Llama)"rms_norm_eps"(some models expose this directly)"layer_norm_eps"(GPT-2, BERT)"eps"(generic)
Positional embedding type¶
Rotary (RoPE) →
positional_embedding_type = "rotary"— Most modern models (Llama, Mistral, Qwen, Gemma)Learned/Standard →
positional_embedding_type = "standard"— GPT-2, OPTCheck for
RotaryEmbeddingclass in the model code
Attention type¶
Multi-Head Attention (MHA) —
n_key_value_heads == n_headsor field absentGrouped Query Attention (GQA) —
n_key_value_heads < n_heads(e.g., Llama 3, Mistral)Multi-Query Attention (MQA) —
n_key_value_heads == 1(e.g., Falcon)
MLP type¶
Gated MLP (SwiGLU) →
gated_mlp = True— Has gate/up/down projections (Llama, Qwen, Gemma)Standard MLP →
gated_mlp = False— Has fc1/fc2 or c_fc/c_proj (GPT-2)
QKV layout¶
Separate Q/K/V — Most models:
q_proj,k_proj,v_projCombined QKV — GPT-2 style: single
c_attnorquery_key_valuelinear layer
Inspect module names¶
To find the exact HuggingFace module paths for the component mapping:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("model-name", torch_dtype="auto")
# Print all named modules
for name, module in model.named_modules():
print(f"{name}: {type(module).__name__}")
What to look for¶
Map these HF module paths to TL component mapping entries:
TL Name |
Look for in HF |
Common HF Paths |
|---|---|---|
|
Token embedding |
|
|
Position embedding (if standard) |
|
|
Rotary embedding (if RoPE) |
|
|
Layer list |
|
|
Pre-attention norm |
|
|
Post-attention norm |
|
|
Self-attention module |
|
|
Query projection |
|
|
Key projection |
|
|
Value projection |
|
|
Output projection |
|
|
Combined QKV (if used) |
|
|
MLP module |
|
|
Gate projection (if gated) |
|
|
Up/input projection |
|
|
Down/output projection |
|
|
Final layer norm |
|
|
LM head |
|
Check for biases¶
# Check if a specific layer has bias
layer = model.model.layers[0]
print(f"Q bias: {layer.self_attn.q_proj.bias is not None}")
print(f"MLP in bias: {layer.mlp.up_proj.bias is not None}")
Document which layers lack biases — this affects the “Optional Parameters” section of the adapter docstring.
Examine state dict keys¶
# Print all parameter names and shapes
for key, param in model.state_dict().items():
print(f"{key}: {param.shape}")
This helps verify:
Weight naming patterns match your component mapping
Tensor shapes match expected dimensions
No unexpected parameters that need special handling
Find an existing similar adapter¶
Check if a similar architecture already has an adapter. Most new models are variants of existing patterns:
If your model is like… |
Start from adapter… |
|---|---|
Llama, Mistral, Qwen2, Gemma |
|
GPT-2, GPT-J |
|
BLOOM, Falcon |
|
T5, encoder-decoder |
|
MoE model |
|
Multimodal (vision+text) |
|
Quick reference: decision tree¶
1. Does the model use RMSNorm or LayerNorm?
→ RMSNorm: normalization_type="RMS", use RMSNormalizationBridge
→ LayerNorm: normalization_type="LN", use NormalizationBridge
2. Does the model use RoPE or learned positional embeddings?
→ RoPE: positional_embedding_type="rotary", add RotaryEmbeddingBridge, use PositionEmbeddingsAttentionBridge
→ Learned: positional_embedding_type="standard", add PosEmbedBridge
3. Are Q/K/V separate or combined?
→ Separate: use PositionEmbeddingsAttentionBridge with q/k/v/o submodules
→ Combined: use JointQKVAttentionBridge with qkv/o submodules
4. Does the MLP have a gate projection?
→ Yes (gate+up+down): gated_mlp=True, use GatedMLPBridge
→ No (in+out): gated_mlp=False, use MLPBridge
5. Is n_key_value_heads < n_heads?
→ Yes: GQA — set n_key_value_heads on cfg
→ No: standard MHA — no special handling needed