Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts

• Original White Paper (PDF)

1|# Privacy-Preserving Federated Learning for Medical Diagnosis using Differential Privacy 2| 3|Core Thesis 4|The paper introduces ā€œSteer-to-Detectā€ (S2D), a two-stage framework for the passive detection of LLM-generated text. It argues that while raw hidden representations of LLMs contain signals that distinguish human-written text from machine-generated text, these signals often overlap. S2D aims to amplify these signals by ā€œsteeringā€ the hidden representations of a frozen observer LLM using a learned vector to increase class separability before applying a hypothesis test. 5| 6|Innovation 7|The primary innovation is the use of activation steering as a detection mechanism. In Phase I, S2D learns a universal steering vector $\mathbf{v}$ by maximizing the log-likelihood of class-conditional von Mises–Fisher (vMF) distributions on the unit hypersphere. This vector is injected into the hidden states of a frozen observer model during the forward pass, effectively reshaping the geometry of the latent space to better separate the two classes. In Phase II, detection is performed via a log-likelihood ratio test, projecting the steered representations onto a discriminative direction. 8| 9|Key Results 10|- Robust Separability: S2D demonstrates strong and consistent performance across in-distribution and out-of-distribution (OOD) settings, outperforming traditional train-free (e.g., Binoculars) and train-based (e.g., RoBERTa) detectors. 11|- Theoretical Guarantees: The authors provide finite-sample, high-probability guarantees for Type I error control (false positives) and explicit upper bounds on excess Type II error. 12|- Resilience: The system shows significant robustness to adversarial perturbations, paraphrasing, and variations in input length, which are common failure points for log-probability-based detectors. 13|- Calibration: The use of an independent calibration sample to set the threshold $\tau$ allows for strict control over false positive rates, which is critical for high-stakes applications like academic integrity. 14| 15|Implication 16|S2D shifts the focus of LLM detection from analyzing the ā€œoutputā€ (logits/tokens) to manipulating the ā€œinternal stateā€ (hidden representations). By treating the observer LLM as a feature extractor that can be steered, it provides a more flexible and theoretically grounded approach to detection. This suggests that the ā€œfingerprintā€ of LLM generation is not just in the choice of words, but in the structural properties of the activation space, which can be amplified and isolated. 17| 18|Verdict 19|Medium to High Impact. While the ā€œarms raceā€ between generators and detectors is perpetual, S2D provides a rigorous theoretical framework for amplification and error control that is more robust than previous heuristic methods. 20|