Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Published in ICLR 2026, 2025

Abstract

We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain, and these biases can be discovered using simple tools from model diffing—the study of differences between models before and after finetuning. Analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We also created an LLM-based interpretability agent to understand the finetuning domain that performs significantly better with access to the bias. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters).