Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts

1|# Privacy-Preserving Federated Learning for Medical Diagnosis using Differential Privacy 2| 3|Core Thesis 4|The paper introduces “Steer-to-Detect” (S2D), a two-stage framework for the passive detection of LLM-generated text. It argues that while raw hidden representations of LLMs contain signals that distinguish human-written text from machine-generated text, these signals often overlap. S2D aims to amplify these signals by “steering” the hidden representations of a frozen observer LLM using a learned vector to increase class separability before applying a hypothesis test. 5| 6|Innovation 7|The primary innovation is the use of activation steering as a detection mechanism. In Phase I, S2D learns a universal steering vector $\mathbf{v}$ by maximizing the log-likelihood of class-conditional von Mises–Fisher (vMF) distributions on the unit hypersphere. This vector is injected into the hidden states of a frozen observer model during the forward pass, effectively reshaping the geometry of the latent space to better separate the two classes. In Phase II, detection is performed via a log-likelihood ratio test, projecting the steered representations onto a discriminative direction. 8| 9|Key Results 10|- Robust Separability: S2D demonstrates strong and consistent performance across in-distribution and out-of-distribution (OOD) settings, outperforming traditional train-free (e.g., Binoculars) and train-based (e.g., RoBERTa) detectors. 11|- Theoretical Guarantees: The authors provide finite-sample, high-probability guarantees for Type I error control (false positives) and explicit upper bounds on excess Type II error. 12|- Resilience: The system shows significant robustness to adversarial perturbations, paraphrasing, and variations in input length, which are common failure points for log-probability-based detectors. 13|- Calibration: The use of an independent calibration sample to set the threshold $\tau$ allows for strict control over false positive rates, which is critical for high-stakes applications like academic integrity. 14| 15|Implication 16|S2D shifts the focus of LLM detection from analyzing the “output” (logits/tokens) to manipulating the “internal state” (hidden representations). By treating the observer LLM as a feature extractor that can be steered, it provides a more flexible and theoretically grounded approach to detection. This suggests that the “fingerprint” of LLM generation is not just in the choice of words, but in the structural properties of the activation space, which can be amplified and isolated. 17| 18|Verdict 19|Medium to High Impact. While the “arms race” between generators and detectors is perpetual, S2D provides a rigorous theoretical framework for amplification and error control that is more robust than previous heuristic methods. 20|