transformer_lens.HookedAudioEncoder module¶

Hooked Audio Encoder.

Contains a HuBERT style model. This is separate from transformer_lens.HookedTransformer because it has a significantly different architecture to e.g. GPT style transformers.

class transformer_lens.HookedAudioEncoder.HookedAudioEncoder(cfg: HookedTransformerConfig | Dict, move_to_device: bool = True, model_name: str = 'facebook/hubert-base-ls960', **kwargs: Any)¶

Bases: HookedRootModule

This class implements a BERT-style encoder using the components in ./components.py, with HookPoints on every interesting activation. It inherits from HookedRootModule.

Limitations: - The model does not include dropouts, which may lead to inconsistent results from training or fine-tuning.

Like HookedTransformer, it can have a pretrained Transformer’s weights loaded via .from_pretrained. There are a few features you might know from HookedTransformer which are not yet supported:

There is no preprocessing (e.g. LayerNorm folding) when loading a pretrained model

property OV: FactoredMatrix¶: Returns a FactoredMatrix object with the product of the O and V matrices for each layer and head.

property QK: FactoredMatrix¶: Returns a FactoredMatrix object with the product of the Q and K matrices for each layer and head. Useful for visualizing attention patterns.

property W_K: Float[Tensor, 'n_layers n_heads d_model d_head']¶: Stacks the key weights across all layers

property W_O: Float[Tensor, 'n_layers n_heads d_head d_model']¶: Stacks the attn output weights across all layers

property W_Q: Float[Tensor, 'n_layers n_heads d_model d_head']¶: Stacks the query weights across all layers

property W_V: Float[Tensor, 'n_layers n_heads d_model d_head']¶: Stacks the value weights across all layers

property W_in: Float[Tensor, 'n_layers d_model d_mlp']¶: Stacks the MLP input weights across all layers

property W_out: Float[Tensor, 'n_layers d_mlp d_model']¶: Stacks the MLP output weights across all layers

all_head_labels() → List[str]¶: Returns a list of strings with the format “L{l}H{h}”, where l is the layer index and h is the head index.

property b_K: Float[Tensor, 'n_layers n_heads d_head']¶: Stacks the key biases across all layers

property b_O: Float[Tensor, 'n_layers d_model']¶: Stacks the attn output biases across all layers

property b_Q: Float[Tensor, 'n_layers n_heads d_head']¶: Stacks the query biases across all layers

property b_V: Float[Tensor, 'n_layers n_heads d_head']¶: Stacks the value biases across all layers

property b_in: Float[Tensor, 'n_layers d_mlp']¶: Stacks the MLP input biases across all layers

property b_out: Float[Tensor, 'n_layers d_model']¶: Stacks the MLP output biases across all layers

blocks: TypedModuleList[BertBlock]¶

cpu() → T¶

Move all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:: self
Return type:: Module

cuda(device: int | device | None = None) → T¶

Move all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters:: device (int, optional) – if specified, all parameters will be copied to that device
Returns:: self
Return type:: Module

encoder_output(frames: Tensor, one_zero_attention_mask: Tensor | None = None)¶

HuBERT-like forward (Transformer-Lens style).

Parameters:

input – one of: - 1D torch.Tensor or numpy array (single waveform) OR list of 1D waveforms -> will call self.to_frames(…) - 3D torch.Tensor shaped (batch, frames, d_model) -> treated as precomputed frames (skip to_frames) - tuple (frames, frame_mask) -> use directly
sampling_rate – sampling rate for to_frames when converting raw audio.
use_proj – Whether to use the final head of HubertCTC
move_to_device – move tensors to self.cfg.device (to match your other code).

Returns:

“hidden”: (batch, frames, d_model) final encoder hidden states

Return type:

Depending on return_type

classmethod from_pretrained(model_name: str, checkpoint_index: int | None = None, checkpoint_value: int | None = None, hf_model: Any | None = None, device: str | None = None, move_to_device: bool = True, dtype: dtype = torch.float32, **from_pretrained_kwargs: Any) → HookedAudioEncoder¶: Loads in the pretrained weights from huggingface. Currently supports loading weight from HuggingFace BertForMaskedLM. Unlike HookedTransformer, this does not yet do any preprocessing on the model.

hubert_model: HubertModel | Wav2Vec2Model¶

mps() → T¶

processor: Any¶

run_with_cache(*model_args: Any, return_cache_object: Literal[True] = True, **kwargs: Any) → Tuple[Float[Tensor, 'batch pos d_vocab'], ActivationCache]¶
run_with_cache(*model_args: Any, return_cache_object: Literal[False], **kwargs: Any) → Tuple[Float[Tensor, 'batch pos d_vocab'], Dict[str, Tensor]]: Wrapper around run_with_cache in HookedRootModule. If return_cache_object is True, this will return an ActivationCache object, with a bunch of useful HookedTransformer specific methods, otherwise it will return a dictionary of activations as in HookedRootModule. This function was copied directly from HookedTransformer.

to(device_or_dtype: device | str | dtype, print_details: bool = True)¶

Move and/or cast the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)

to(dtype, non_blocking=False)

to(tensor, non_blocking=False)

to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Parameters:

device (torch.device) – the desired device of the parameters and buffers in this module
dtype (torch.dtype) – the desired floating point or complex dtype of the parameters and buffers in this module
tensor (torch.Tensor) – Tensor whose dtype and device are the desired dtype and device for all parameters and buffers in this module
memory_format (torch.memory_format) – the desired memory format for 4D parameters and buffers in this module (keyword only argument)

Returns:

self

Return type:

Module

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)

to_frames(raw_inputs: Tensor | List[Tensor | ndarray], sampling_rate: int = 16000, move_to_device: bool = True) → Tuple[Tensor, Tensor | None]¶

Convert raw audio batch -> (projected frames, frame_attention_mask)

Parameters:

raw_inputs – one of: - a 1D torch.Tensor or numpy array (single waveform) - a list of 1D torch.Tensors / numpy arrays (batch)
self.processor – HF AutoProcessor (creates input_values + sample-level attention_mask)
self.model – pretrained HubertModel (provides feature_extractor and feature_projection)
sampling_rate – sample rate of the audio (default 16k)
move_to_device – move outputs to model.device

Returns:

torch.Tensor of shape (batch, frames, hidden_size) <- after feature_projection frame_attention_mask: torch.LongTensor of shape (batch, frames) with 1 for real frames, 0 for padding

Return type:

frames