transformer_lens.model_bridge.supported_architectures.nemotron_h module¶

Nemotron-H hybrid Mamba2-Transformer architecture adapter.

Supports NemotronHForCausalLM (nvidia/Nemotron-H-8B-Base, Nemotron-H-47B-A13B).

Architecture overview: - Heterogeneous layers defined by config.layers_block_type — each element is

one of "mamba", "attention", "moe", or "mlp".

~8% of layers are standard GQA attention; the rest are Mamba-2 SSM, dense MLP, or sparse MoE. All share a single pre-norm (block.norm) and a single residual path; there is no ln2 or post-attention norm.
Each block exposes a single .mixer attribute whose type varies by layer.
No model-level rotary embedding module — attention handles RoPE internally via position_ids passed from the outer model loop.
Stateful generation: uses DynamicCache (transformers ≥ 5.12) which carries both KV-cache entries (attention layers) and SSM conv/recurrent states (Mamba layers) in a unified object.

Key adapter decisions: - SSMBlockBridge is used as the block container. It delegates the entire

forward to the HF block, giving hook_in / hook_out on the residual stream without hardcoding transformer-specific hook positions (hook_resid_mid, hook_mlp_in, etc.) that do not exist in this single-norm architecture.

SSM2MixerBridge wraps .mixer for all layer types. Its forward is a pure passthrough (original_component(*args, **kwargs)) so it works correctly for attention, MLP, and MoE mixers as well as Mamba ones. Mamba-specific inner submodules (in_proj, conv1d, inner_norm, out_proj) are declared optional=True so setup skips them gracefully on non-Mamba layers.
MLP layers use relu2 activation (not SwiGLU); gated_mlp = False.
applicable_phases = []: verify_models is transformer-shaped and would require a dedicated refactor to cover SSM hybrids. Coverage lives in the integration test instead.

class transformer_lens.model_bridge.supported_architectures.nemotron_h.NemotronHArchitectureAdapter(cfg: Any)¶

Bases: ArchitectureAdapter

Architecture adapter for NemotronHForCausalLM.

Hybrid Mamba-2 + Attention + MoE + dense MLP model. All layers share a single pre-norm and a single residual connection; the mixer type per layer is determined by config.layers_block_type[layer_idx].

applicable_phases: list[int] = []¶

component_mapping: ComponentMapping | None¶

create_stateful_cache(hf_model: Any, batch_size: int, device: Any, dtype: Any) → Any¶

Build the unified DynamicCache for stateful generation.

Transformers ≥ 5.12 ships a unified DynamicCache that carries both KV-cache entries (attention layers) and SSM conv/recurrent states (Mamba layers) in a single object, using has_previous_state() to distinguish which state is available for a given layer index.

uses_split_attention: bool¶

weight_processing_conversions: Dict[str, ParamProcessingConversion | str] | None¶