Basic Usage#
Standardized Interface#
Different transformer models use different naming conventions. nnterp
standardizes all models to use the llama naming convention:
StandardizedTransformer
├── layers
│ ├── self_attn
│ └── mlp
├── ln_final
└── lm_head
Loading Models#
from nnterp import StandardizedTransformer
# These all work the same way
model = StandardizedTransformer("gpt2")
model = StandardizedTransformer("meta-llama/Llama-2-7b-hf")
# Uses device_map="auto" by default
print(model.device)
# Access the model's hidden size and number of attention heads (if available)
print(f"hidden size: {model.hidden_size}")
print(f"number of attention heads: {model.num_heads}")
Accessing Module I/O#
Access layer inputs and outputs directly:
with model.trace("hello"):
# Access layer outputs
layer_5_output = model.layers_output[5]
# Access attention and MLP outputs:
with model.trace("hello"):
attn_output = model.attentions_output[3]
mlp_output = model.mlps_output[3]
Skip Layers#
with model.trace("Hello world"):
# Skip layer 1
model.skip_layer(1)
# Skip layers 2 through 3
model.skip_layers(2, 3)
Use saved activations:
import torch
with model.trace("Hello world") as tracer:
layer_6_out = model.layers_output[6].save()
tracer.stop()
with model.trace("Hello world"):
model.skip_layers(0, 6, skip_with=layer_6_out)
result = model.logits.save()
with model.trace("Hello world"):
results_vanilla = model.logits.save()
assert torch.allclose(results_vanilla, results_skipped)
Built-in Methods#
Project to vocabulary (apply unembed ln_final and lm_head to an activation):
with model.trace("The capital of France is"):
hidden = model.layers_output[5]
logits = model.project_on_vocab(hidden)
Steering:
import torch
steering_vector = torch.randn(768) # gpt2 hidden size
with model.trace("The weather today is"):
model.steer(layers=[1, 3], steering_vector=steering_vector, factor=0.5)