transformer_lens.model_bridge.supported_architectures.nemotron_h module¶
Nemotron-H hybrid Mamba2-Transformer architecture adapter.
Supports NemotronHForCausalLM (nvidia/Nemotron-H-8B-Base, Nemotron-H-47B-A13B).
Architecture overview:
- Heterogeneous layers defined by config.layers_block_type — each element is
one of
"mamba","attention","moe", or"mlp".
~8% of layers are standard GQA attention; the rest are Mamba-2 SSM, dense MLP, or sparse MoE. All share a single pre-norm (
block.norm) and a single residual path; there is noln2or post-attention norm.Each block exposes a single
.mixerattribute whose type varies by layer.No model-level rotary embedding module — attention handles RoPE internally via
position_idspassed from the outer model loop.Stateful generation: uses
DynamicCache(transformers ≥ 5.12) which carries both KV-cache entries (attention layers) and SSM conv/recurrent states (Mamba layers) in a unified object.
Key adapter decisions:
- SSMBlockBridge is used as the block container. It delegates the entire
forward to the HF block, giving
hook_in/hook_outon the residual stream without hardcoding transformer-specific hook positions (hook_resid_mid, hook_mlp_in, etc.) that do not exist in this single-norm architecture.
SSM2MixerBridgewraps.mixerfor all layer types. Its forward is a pure passthrough (original_component(*args, **kwargs)) so it works correctly for attention, MLP, and MoE mixers as well as Mamba ones. Mamba-specific inner submodules (in_proj, conv1d, inner_norm, out_proj) are declaredoptional=Trueso setup skips them gracefully on non-Mamba layers.MLP layers use
relu2activation (not SwiGLU);gated_mlp = False.applicable_phases = []:verify_modelsis transformer-shaped and would require a dedicated refactor to cover SSM hybrids. Coverage lives in the integration test instead.
- class transformer_lens.model_bridge.supported_architectures.nemotron_h.NemotronHArchitectureAdapter(cfg: Any)¶
Bases:
ArchitectureAdapterArchitecture adapter for NemotronHForCausalLM.
Hybrid Mamba-2 + Attention + MoE + dense MLP model. All layers share a single pre-norm and a single residual connection; the mixer type per layer is determined by
config.layers_block_type[layer_idx].- applicable_phases: list[int] = []¶
- component_mapping: ComponentMapping | None¶
- create_stateful_cache(hf_model: Any, batch_size: int, device: Any, dtype: Any) Any¶
Build the unified DynamicCache for stateful generation.
Transformers ≥ 5.12 ships a unified
DynamicCachethat carries both KV-cache entries (attention layers) and SSM conv/recurrent states (Mamba layers) in a single object, usinghas_previous_state()to distinguish which state is available for a given layer index.
- uses_split_attention: bool¶
- weight_processing_conversions: Dict[str, ParamProcessingConversion | str] | None¶