Documentation#
nnterp.standardized_transformer module#
- class StandardizedTransformer(model, trust_remote_code=False, check_renaming=True, allow_dispatch=True, check_attn_probs_with_trace=True, rename_config=None, **kwargs)[source]#
Bases:
LanguageModel
Renames the LanguageModel modules to match a standardized architecture.
The model structure is organized as follows:
StandardizedTransformer ├── layers │ ├── self_attn │ └── mlp ├── ln_final └── lm_head
In addition to renaming modules, this class provides built-in accessors to extract and set intermediate activations:
layers[i]: Get layer module at layer i
layers_input[i]: Get/set layer input at layer i
layers_output[i]: Get/set layer output at layer i
attentions_output[i]: Get/set attention output at layer i
attentions[i]: Get attention module at layer i
mlps_output[i]: Get/set MLP output at layer i
mlps[i]: Get MLP module at layer i
- Parameters:
repo_id (str) – Hugging Face repository ID or path of the model to load.
trust_remote_code (bool, optional) – If True, remote code will be trusted when loading the model. Defaults to False.
check_renaming (bool, default True) – If True, the renaming of modules is validated. Defaults to True.
allow_dispatch (bool, default True) – If True, allows using trace() to dispatch the model when scan() fails during renaming checks. Defaults to True. You should set this to false if you plan to use the model remotely.
check_attn_probs_with_trace (bool, default True) – If True, the model will be dispatched and a test will ensure that the attention probabilities returned sum to 1.
rename_config (RenameConfig, default None) – A RenameConfig object to use for renaming the model. If None, a default RenameConfig will be used.
model (str | Module)
- property add_prefix_false_tokenizer: PreTrainedTokenizerBase#
- property attn_probs_available: bool#
- property input_size: Size#
- property attention_mask: Tensor | Object#
- property logits: Tensor | Object#
Returns the lm_head output
- property next_token_probs: Tensor | Object#
- skip_layer(layer, skip_with=None)[source]#
Skip the computation of a layer.
- Parameters:
layer (int) – The layer to skip
skip_with (Tensor | Object | None) – The input to skip the layer with. If None, the input of the layer is used.
- skip_layers(start_layer, end_layer, skip_with=None)[source]#
Skip all layers between start_layer and end_layer (inclusive).
- Parameters:
start_layer (int) – The layer to start skipping from
end_layer (int) – The layer to stop skipping at (inclusive)
skip_with (Tensor | Object | None) – The input to skip the layers with. If None, the input of start_layer is used.
- steer(layers, steering_vector, factor=1, positions=None, get_layer_object_to_steer=None)[source]#
Steer the hidden states of a layer using a steering vector.
- Parameters:
layers (int | list[int]) – The layer(s) to steer
steering_vector (Tensor) – The steering vector to apply
factor (float) – The factor to multiply the steering vector by
positions (int | list[int] | Tensor | None) – The position to steer. If None, all positions are steered.
get_layer_object_to_steer (Callable[[int], Tensor | Object] | None) – Function that given a layer index, returns the object to steer in the model’s. Default to model.layers_output[layer]
nnterp.interventions module#
- logit_lens(nn_model, prompts, remote=False)[source]#
Same as logit_lens but for Llama models directly instead of Transformer_lens models. Get the probabilities of the next token for the last token of each prompt at each layer using the logit lens.
- Parameters:
nn_model (LanguageModel) – NNSight Language Model
prompts (list[str] | str) – List of prompts or a single prompt
- Returns:
A tensor of shape (num_prompts, num_layers, vocab_size) containing the probabilities of the next token for each prompt at each layer. Tensor is on the CPU.
- class TargetPrompt(prompt: 'str', index_to_patch: 'int')[source]#
Bases:
object
- Parameters:
prompt (str)
index_to_patch (int)
- prompt: str#
- index_to_patch: int#
- repeat_prompt(words=None, rel=' ', sep='\n', placeholder='?')[source]#
Prompt used in the patchscopes paper to predict the next token. PAIR-code/interpretability
- Parameters:
words – The words to repeat. If None, the words will be “king”, “1135”, “hello”.
rel – The string between the repeated words
sep – The separator between the words
placeholder – The placeholder to use for the last word
- Returns:
A TargetPrompt object containing the prompt to patch and the index of the token to patch.
- Return type:
- class TargetPromptBatch(prompts, index_to_patch)[source]#
Bases:
object
A class to handle multiple target prompts with potentially different indices to patch
- Parameters:
prompts (list[str])
index_to_patch (Tensor)
- prompts: list[str]#
- index_to_patch: Tensor#
- classmethod from_target_prompts(prompts_)[source]#
- Parameters:
prompts_ (list[TargetPrompt])
- classmethod from_target_prompt(prompt, batch_size)[source]#
- Parameters:
prompt (TargetPrompt)
batch_size (int)
- classmethod from_prompts(prompts, index_to_patch)[source]#
- Parameters:
prompts (str | list[str])
index_to_patch (int | list[int] | Tensor)
- static auto(target_prompt, batch_size)[source]#
- Parameters:
target_prompt (str | TargetPrompt | list[TargetPrompt] | TargetPromptBatch)
batch_size (int)
- patchscope_lens(nn_model, source_prompts=None, target_patch_prompts=None, layers=None, latents=None, remote=False)[source]#
Replace the hidden state of the patch_prompt.index_to_patch token in the patch_prompt.prompt with the hidden state of the last token of each prompt at each layer. Returns the probabilities of the next token in patch_prompt for each prompt for each layer intervention. :param nn_model: The NNSight TL model :param source_prompts: List of prompts or a single prompt to get the hidden states of the last token :param target_patch_prompts: TargetPrompt(s) / TargetPromptBatch containing the prompt to patch and the index of the token to patch :param layers: List of layers to intervene on. If None, all layers are intervened on. :param latents: List of latents to use. If None, the hidden states of the last token of each source prompt at each layer are collected. :param remote: If True, the function will run on the nndif server. See nnsight.net/status to check which models are available.
- Returns:
A tensor of shape (num_prompts, num_layers, vocab_size) containing the probabilities of the next token for each prompt at each layer. Tensor is on the CPU.
- Parameters:
nn_model (LanguageModel)
source_prompts (list[str] | str | None)
target_patch_prompts (TargetPromptBatch | list[TargetPrompt] | TargetPrompt | None)
- patchscope_generate(nn_model, prompts, target_patch_prompt, max_length=50, layers=None, remote=False, max_batch_size=32)[source]#
Replace the hidden state of the patch_prompt.index_to_patch token in the patch_prompt.prompt with the hidden state of the last token of each prompt at each layer. Returns the probabilities of the next token in patch_prompt for each prompt for each layer intervention. :param nn_model: The NNSight LanguageModel with llama architecture :param prompts: List of prompts or a single prompt to get the hidden states of the last token :param target_patch_prompt: A TargetPrompt object containing the prompt to patch and the index of the token to patch :param layers: List of layers to intervene on. If None, all layers are intervened on. :param max_length: The maximum length of the generated sequence :param remote: If True, the function will run on the nndif server. See nnsight.net/status to check which models are available. :param max_batch_size: The maximum number of prompts to intervene on at once.
- Returns:
A tensor of shape (num_prompts, num_layers, vocab_size) containing the probabilities of the next token for each prompt at each layer. Tensor is on the CPU.
- Parameters:
nn_model (LanguageModel)
prompts (list[str] | str)
target_patch_prompt (TargetPrompt)
max_length (int)
- steer(nn_model, layers, steering_vector, factor=1, position=-1, get_module=<function get_layer_output>)[source]#
Steer the hidden states of a layer using a steering vector :param nn_model: The NNSight model :param layers: The layer(s) to steer :param steering_vector: The steering vector to apply :param factor: The factor to multiply the steering vector by
- Parameters:
nn_model (LanguageModel)
layers (int | list[int])
steering_vector (Tensor)
factor (float)
position (int)
get_module (Callable[[LanguageModel, int], Tensor | Object])
- patch_object_attn_lens(nn_model, source_prompts, target_prompts, attn_idx_patch, num_patches=5)[source]#
A complex lens that makes the model attend to the hidden states of the last token of the source prompts instead of the attn_idx_patch token of the target prompts at last token prediction. For each layer, this intervention is performed for num_patches layers. :param nn_model: The NNSight model :param source_prompts: The prompts to get the hidden states of the last token from :param target_prompts: The prompts to predict the next token for :param attn_idx_patch: The index of the token to patch in the target prompts :param num_patches: The number of layers to patch for each layer
- Returns:
A tensor of shape (num_target_prompts, num_layers, vocab_size) containing the probabilities of the next token for each target prompt at each layer. Tensor is on the CPU.
- Parameters:
nn_model (LanguageModel)
source_prompts (list[str] | str)
target_prompts (list[str] | str)
attn_idx_patch (int)
num_patches (int)
nnterp.prompt_utils module#
- get_first_tokens(words, llm_or_tokenizer, use_hacky_implementation=False)[source]#
Get the all the first tokens of a “word” and “ word” for all words.
- Parameters:
words (str | list[str]) – A string or a list of strings to get the first token of.
llm_or_tokenizer (LanguageModel | StandardizedTransformer | PreTrainedTokenizerBase) – The tokenizer to use. If a LanguageModel or StandardizedTransformer is provided, the tokenizer will be extracted from it. It is recommended to use StandardizedTransformer. If you want to use your own tokenizer, it’s recommended to initialize it with add_prefix_space=False or to use the hacky implementation.
use_hacky_implementation – If True, use a hacky implementation to get the first token of a word by tokenizing “🍐word” and extracting the first token of word. While hacky, it is still guaranteed to work correctly or raise an error.
- Returns:
A list of tokens.
- Return type:
list[int]
- class Prompt(prompt, target_tokens, target_strings=None)[source]#
Bases:
object
Generic class to represent a prompt with target tokens to track during next token prediction.
- Parameters:
prompt (str) – The prompt to use
target_tokens (dict[str, list[int]]) – A dictionary of target tokens for each target
target_strings (dict[str, str | list[str]] | None) – A dictionary of target strings for each target
- prompt: str#
- target_tokens: dict[str, list[int]]#
- target_strings: dict[str, str | list[str]] | None = None#
- next_token_probs_unsqueeze(nn_model, prompt, remote=False, **_kwargs)[source]#
- Parameters:
nn_model (LanguageModel)
prompt (str | list[str])
- Return type:
Tensor
- run_prompts(nn_model, prompts, batch_size=32, get_probs_func=None, func_kwargs=None, remote=False, tqdm=<class 'tqdm.asyncio.tqdm_asyncio'>)[source]#
Run a list of prompts through the model and return the probabilities of the next token for the target tokens.
- Parameters:
nn_model (LanguageModel) – The NNSight model
prompts (list[Prompt]) – A list of prompts. All prompts must have the same target keys
batch_size (int) – The batch size to use
get_probs – The function to get the probabilities of the next token, default to next token prediction
method_kwargs – The kwargs to pass to the get_probs function
tqdm – The tqdm function to use, default to tqdm.auto.tqdm. Use None to disable tqdm
get_probs_func (Callable | None)
func_kwargs (dict | None)
remote (bool)
- Returns:
A dictionary of target names and the probabilities of the next token for the target tokens.
- Return type:
dict[str, Tensor]
nnterp.display module#
nnterp.nnsight_utils module#
- get_layers(model)[source]#
Get the layers of the model
- Parameters:
model (LanguageModel)
- Return type:
list[Envoy]
- get_num_layers(nn_model)[source]#
Get the number of layers in the model :param nn_model: The NNSight model
- Returns:
The number of layers in the model
- Parameters:
nn_model (LanguageModel)
- get_layer(nn_model, layer)[source]#
Get the layer of the model :param nn_model: The NNSight model :param layer: The layer to get
- Returns:
The Envoy for the layer
- Parameters:
nn_model (LanguageModel)
layer (int)
- Return type:
Envoy
- get_layer_input(nn_model, layer)[source]#
Get the hidden state input of a layer :param nn_model: The NNSight model :param layer: The layer to get the input of
- Returns:
The Proxy for the input of the layer
- Parameters:
nn_model (LanguageModel)
layer (int)
- Return type:
int | Object
- get_layer_output(nn_model, layer)[source]#
Get the output of a layer :param nn_model: The NNSight model :param layer: The layer to get the output of
- Returns:
The Proxy for the output of the layer
- Parameters:
nn_model (LanguageModel)
layer (int)
- Return type:
Tensor | Object
- get_attention(nn_model, layer)[source]#
Get the attention module of a layer :param nn_model: The NNSight model :param layer: The layer to get the attention module of
- Returns:
The Envoy for the attention module of the layer
- Parameters:
nn_model (LanguageModel)
layer (int)
- Return type:
Envoy
- get_attention_output(nn_model, layer)[source]#
Get the output of the attention block of a layer :param nn_model: The NNSight model :param layer: The layer to get the output of
- Returns:
The Proxy for the output of the attention block of the layer
- Parameters:
nn_model (LanguageModel)
layer (int)
- Return type:
Tensor | Object
- get_mlp(nn_model, layer)[source]#
Get the MLP of a layer
- Parameters:
nn_model (LanguageModel)
layer (int)
- Return type:
Envoy
- get_mlp_output(nn_model, layer)[source]#
Get the output of the MLP of a layer
- Parameters:
nn_model (LanguageModel)
layer (int)
- Return type:
Tensor | Object
- get_logits(nn_model)[source]#
Get the logits of the model :param nn_model: The NNSight model
- Returns:
The Proxy for the logits of the model
- Parameters:
nn_model (LanguageModel)
- Return type:
Tensor | Object
- get_unembed_norm(nn_model)[source]#
Get the last layer norm of the model :param nn_model: The NNSight model
- Returns:
The Envoy for the last layer norm of the model
- Parameters:
nn_model (LanguageModel)
- Return type:
Envoy
- get_unembed(nn_model)[source]#
Get the unembed module of the model :param nn_model: The NNSight model
- Returns:
The Envoy for the unembed module of the model
- Parameters:
nn_model (LanguageModel)
- Return type:
Envoy
- project_on_vocab(nn_model, h)[source]#
Project the hidden states on the vocabulary, after applying the model’s last layer norm :param nn_model: The NNSight model :param h: The hidden states to project
- Returns:
The Proxy for the hidden states projected on the vocabulary
- Parameters:
nn_model (LanguageModel)
h (Tensor | Object)
- Return type:
Tensor | Object
- skip_layer(nn_model, layer, skip_with=None)[source]#
Skip the computation of a layer. If skip_with is None, the input of the layer is used as its output. :param nn_model: The NNSight model :param layer: The layer to skip :param skip_with: The input to skip the layer with. If None, the input of the layer is used.
- Parameters:
nn_model (LanguageModel)
layer (int)
skip_with (Tensor | Object | None)
- skip_layers(nn_model, start_layer, end_layer, skip_with=None)[source]#
Skip all layers between start_layer and end_layer (inclusive). Equivalent to:
`py set_layer_output(nn_model, end_layer, get_layer_input(nn_model, start_layer)) `
But skip the useless computa- Parameters:
nn_model (LanguageModel) – The NNSight model
start_layer (int) – The layer to start skipping from
end_layer (int) – The layer to stop skipping at
skip_with (Tensor | Object | None)
- get_next_token_probs(nn_model)[source]#
Get the probabilities of the model :param nn_model: The NNSight model
- Returns:
The Proxy for the probabilities of the model
- Parameters:
nn_model (LanguageModel)
- Return type:
Tensor | Object
- set_layer_output(nn_model, layer, tensor)[source]#
Set the output of a layer to a certain tensor. :param nn_model: The NNSight model :param layer: The layer to set the output of :param tensor: The tensor to set the output of the layer to
- Parameters:
nn_model (LanguageModel)
layer (int)
tensor (Tensor | Object)
- get_token_activations(nn_model, prompts=None, layers=None, get_activations=None, remote=False, idx=None, tracer=None)[source]#
Collect the hidden states of the last token of each prompt at each layer
- Parameters:
nn_model (LanguageModel) – The NNSight model
prompts – The prompts to collect activations for. Can be None if you call this from an existing tracer.
layers – The layers to collect activations for, default to all layers
get_activations (Callable[[LanguageModel, int], Tensor | Object] | None) – The function to get the activations, default to layer output
remote – Whether to run the model on the remote device
idx (int | None) – The index of the token to collect activations for
tracer – A tracer object to use to collect activations. If None, a new tracer is created.
- Returns:
The hidden states of the last token of each prompt at each layer, moved to cpu. If open_context is False, returns a list of Proxies. Dimensions are (num_layers, num_prompts, hidden_size)
- collect_last_token_activations_session(nn_model, prompts, batch_size, layers=None, get_activations=None, remote=False, idx=None)[source]#
Collect the hidden states of the specified token of each prompt at each layer in batches using a nnsight session.
- Parameters:
nn_model – The NNSight model
prompts – The prompts to collect activations for
batch_size – The batch size to use
layers – The layers to collect activations for, default to all layers
get_activations – The function to get the activations, default to layer output
remote – Whether to run the model on the remote device
idx – The index of the token to collect activations for. Default is -1 (last token).
- Returns:
The hidden states of the specified token of each prompt at each layer, moved to cpu. Dimensions are (num_layers, num_prompts, hidden_size)
- collect_token_activations_batched(nn_model, prompts, batch_size, layers=None, get_activations=None, remote=False, idx=None, tqdm=None, use_session=True)[source]#
Collect the hidden states of the last token of each prompt at each layer in batches
- Parameters:
nn_model (LanguageModel) – The NNSight model
prompts – The prompts to collect activations for
batch_size – The batch size to use
layers – The layers to collect activations for, default to all layers
get_activations (Callable[[LanguageModel, int], Tensor | Object] | None) – The function to get the activations, default to layer output
remote – Whether to run the model on the remote device
idx – The index of the token to collect activations for. Default is -1 (last token).
tqdm – Whether to use tqdm to show progress, default to None (no progress bar)
use_session – Whether to use a nnsight session to collect activations. Not sure why you’d want turn that off but who knows
- Returns:
The hidden states of the specified token of each prompt at each layer, moved to cpu. Dimensions are (num_layers, num_prompts, hidden_size)
- compute_next_token_probs(nn_model, prompt, remote=False)[source]#
Get the probabilities of the next token for the prompt :param nn_model: The NNSight model :param prompt: The prompt to get the probabilities for :param remote: Whether to run the model on the remote device
- Returns:
The probabilities of the next token for the prompt
- Parameters:
nn_model (LanguageModel)
prompt (str | list[str])
- Return type:
Tensor
Internal Modules#
nnterp.utils module#
- try_with_scan(model, function, error_to_throw, allow_dispatch, warn_if_scan_fails=True, errors_to_raise=None)[source]#
Attempt to execute a function using model.scan(), falling back to model.trace() if needed.
This function tries to execute the given function within a model.scan() context first, which avoids dispatching the model. If that fails and fallback is allowed, it will try using model.trace() instead, which does dispatch the model.
- Parameters:
model – The model object that supports .scan() and .trace() methods
function – A callable to execute within the model context (takes no arguments)
error_to_throw (Exception) – Exception to raise if both scan and trace fail
allow_dispatch (bool) – Whether to allow fallback to .trace() if .scan() fails
warn_if_scan_fails (bool, optional) – Whether to log warnings when scan fails. Defaults to True.
errors_to_raise (tuple, optional) – Tuple of exception types that should be raised immediately if encountered during scan, without fallback to trace.
- Returns:
True if scan succeeded, False if trace was used instead
- Return type:
bool
nnterp.rename_utils module#
- exception RenamingError[source]#
Bases:
Exception
Exception raised when the renaming of modules is not properly done.
- class AttnProbFunction[source]#
Bases:
ABC
- abstract get_attention_prob_source(attention_module, return_module_source=False)[source]#
Get the attention probabilities source for a given attention module. If return_module_source is True, return the full module source from where the attention probabilities are computed.
- Parameters:
return_module_source (bool)
- class RenameConfig(attn_name=None, mlp_name=None, ln_final_name=None, lm_head_name=None, model_name=None, layers_name=None, mlp_returns_tuple=None, attn_prob_source=None, ignore_mlp=None, ignore_attn=None, attn_head_config_key=None, hidden_size_config_key=None)[source]#
Bases:
object
Configuration for renaming transformer model modules to standardized names.
This dataclass specifies how to map model-specific module names to standardized names used by nnterp. It allows customization for different transformer architectures.
- Parameters:
attn_name (str or list of str, optional) – Name(s) of the attention module to rename to ‘self_attn’.
mlp_name (str or list of str, optional) – Name(s) of the MLP/feed-forward module to rename to ‘mlp’.
ln_final_name (str or list of str, optional) – Name(s) of the final layer normalization to rename to ‘ln_final’.
lm_head_name (str or list of str, optional) – Name(s) of the language model head to rename to ‘lm_head’.
model_name (str or list of str, optional) – Name(s) of the main model container to rename to ‘model’.
layers_name (str or list of str, optional) – Name(s) of the transformer layers container to rename to ‘layers’.
mlp_returns_tuple (bool, optional) – Whether the MLP module returns a tuple instead of a single tensor. Some architectures (e.g., Mixtral, Qwen2MoE, DBRX) return tuples from MLP.
attn_prob_source (AttnProbFunction, optional) – Custom function for accessing attention probabilities. Should be an instance of AttnProbFunction that defines how to extract attention weights from the attention module.
ignore_mlp (bool, optional) – Whether to skip MLP module processing for this architecture. Some models (e.g., OPT) don’t have a unified MLP module.
ignore_attn (bool, optional) – Whether to skip attention module processing for this architecture. Rarely used, for architectures without standard attention.
attn_head_config_key (str, list of str, or int, optional) – Custom key name for the number of attention heads in model config, or the number of heads directly. Defaults to standard keys: [‘n_heads’, ‘num_attention_heads’, ‘n_head’].
hidden_size_config_key (str, list of str, or int, optional) – Custom key name for hidden size in model config, or the hidden size directly. Defaults to standard keys: [‘hidden_size’, ‘d_model’, ‘n_embd’].
Example
Custom configuration for a non-standard architecture:
config = RenameConfig( attn_name="custom_attention", mlp_name=["feed_forward", "ffn"], mlp_returns_tuple=True )
- attn_name: str | list[str] | None = None#
- mlp_name: str | list[str] | None = None#
- ln_final_name: str | list[str] | None = None#
- lm_head_name: str | list[str] | None = None#
- model_name: str | list[str] | None = None#
- layers_name: str | list[str] | None = None#
- mlp_returns_tuple: bool | None = None#
- attn_prob_source: AttnProbFunction | None = None#
- ignore_mlp: bool | None = None#
- ignore_attn: bool | None = None#
- attn_head_config_key: str | list[str] | int | None = None#
- get_num_attention_heads(model, raise_error=True, rename_config=None)[source]#
- Parameters:
raise_error (bool)
rename_config (RenameConfig | None)
- Return type:
int | None
- Parameters:
raise_error (bool)
rename_config (RenameConfig | None)
- Return type:
int | None
- get_rename_dict(rename_config=None)[source]#
- Parameters:
rename_config (RenameConfig | None)
- Return type:
dict[str, str]
- class IOType(value)[source]#
Bases:
Enum
Enum to specify input or output access
- INPUT = 'input'#
- OUTPUT = 'output'#
- class LayerAccessor(model, attr_name, io_type, returns_tuple=False)[source]#
Bases:
object
I/O accessor that provides input/output access with setter
- Parameters:
attr_name (str | None)
io_type (IOType | None)
returns_tuple (bool)
- bloom_attention_prob_source(attention_module, return_module_source=False)[source]#
- Parameters:
return_module_source (bool)
- default_attention_prob_source(attention_module, return_module_source=False)[source]#
- Parameters:
return_module_source (bool)
- gpt2_attention_prob_source(attention_module, return_module_source=False)[source]#
- Parameters:
return_module_source (bool)
- gptj_attention_prob_source(attention_module, return_module_source=False)[source]#
- Parameters:
return_module_source (bool)
- class AttentionProbabilitiesAccessor(model, rename_config=None)[source]#
Bases:
object
- Parameters:
rename_config (RenameConfig | None)
- get_ignores(model, rename_config=None)[source]#
- Parameters:
rename_config (RenameConfig | None)
- Return type:
list[str]
- mlp_returns_tuple(model, rename_config=None)[source]#
- Parameters:
rename_config (RenameConfig | None)
- Return type:
bool