transformer_lens.model_bridge.supported_architectures.smollm3 module

SmolLM3 architecture adapter.

SmolLM3 (the HuggingFaceTB SmolLM3 family, base and instruct) is a Llama-family decoder. It pairs pre-norm RMSNorm blocks with grouped-query attention (GQA), a SwiGLU gated MLP, rotary position embeddings (RoPE), tied input and output embeddings, and no biases on any projection. The one feature that sets it apart from a plain Llama or Qwen2 decoder is NoPE (No Positional Encoding): RoPE is skipped on a periodic subset of layers. That behaviour is the only piece of this adapter that is not a near-verbatim clone of qwen2.py, and it is handled by the small _SmolLM3AttentionBridge subclass below.

class transformer_lens.model_bridge.supported_architectures.smollm3.SmolLM3ArchitectureAdapter(cfg: Any)

Bases: ArchitectureAdapter

Architecture adapter for SmolLM3 models.

SmolLM3 is a pre-norm decoder with RMSNorm, grouped-query attention (GQA), a SwiGLU gated MLP, rotary position embeddings (RoPE), tied input and output embeddings, and no biases on any projection. The block shape matches Llama and Qwen2 exactly, so the component mapping and weight conversions mirror qwen2.py.

NoPE (No Positional Encoding): SmolLM3 disables RoPE on every no_rope_layer_interval-th layer (default every 4th) via config.no_rope_layers. That per-layer toggle lives inside HF’s SmolLM3Attention.forward, but the bridge reimplements attention and would otherwise rotate Q and K on those layers. The _SmolLM3AttentionBridge subclass handles it by suppressing position embeddings on NoPE layers, so the reimplemented attention matches HF.

No Q/K normalization: unlike Qwen3, SmolLM3 has no per-head Q or K RMSNorm, so the attention block uses the plain q/k/v/o submodules.

Optional Parameters (may not exist in state_dict):

SmolLM3 models do NOT have biases on any linear layers:

  • blocks.{i}.attn.b_Q - No bias on query projection

  • blocks.{i}.attn.b_K - No bias on key projection

  • blocks.{i}.attn.b_V - No bias on value projection

  • blocks.{i}.attn.b_O - No bias on output projection

  • blocks.{i}.mlp.b_in - No bias on MLP input (up_proj)

  • blocks.{i}.mlp.b_gate - No bias on MLP gate projection

  • blocks.{i}.mlp.b_out - No bias on MLP output (down_proj)

  • blocks.{i}.ln1.b - RMSNorm has no bias

  • blocks.{i}.ln2.b - RMSNorm has no bias

  • ln_final.b - RMSNorm has no bias

Weight processing must handle these missing biases gracefully using ProcessWeights._safe_get_tensor() or by checking for None values.

__init__(cfg: Any) None

Initialize the SmolLM3 architecture adapter.

setup_component_testing(hf_model: Any, bridge_model: Any = None) None

Wire rotary embeddings and force eager attention for component testing.

SmolLM3 uses RoPE on most layers (a periodic subset are NoPE, handled by the attention bridge). We set the shared rotary_emb reference on every attention bridge instance and pin eager attention so the bridge’s reimplemented forward matches the HF reference numerically. Setting rotary_emb on NoPE-layer bridges is harmless: those bridges suppress position embeddings before the rotary step, so the reference goes unused there.

Parameters:
  • hf_model – The HuggingFace SmolLM3 model instance.

  • bridge_model – The TransformerBridge model, when available, so the rotary reference is set on the live attention bridge instances.