LLM Safety Lives in One Dimension. Attackers Can Delete It.

Researchers from multiple institutions have found that the safety training in large language models is disturbingly fragile. In a paper published on arXiv, authors Andy Arditi, Neel Nanda, and colleagues discovered that refusal behavior across 13 popular open-source chat models (up to 72B parameters) comes down to a single one-dimensional subspace in the model's activations. Erase that direction from the residual stream, and the model will comply with harmful instructions. Add it, and the model refuses even harmless requests.

One dimension. That's all that stands between a helpful assistant and a compliant attack tool.

The team used this insight to build a white-box jailbreak method that cleanly disables refusal while leaving other capabilities intact. They also showed how existing adversarial suffixes work by suppressing this same refusal direction. The implication is stark: current safety fine-tuning concentrates an entire model's "don't do bad stuff" behavior into one dimension that can be cleanly removed.

But defenses are already emerging. A 2025 follow-up paper called "An Embarrassingly Simple Defense Against LLM Abliteration Attacks" proposes distributing the refusal signal across multiple token positions through extended-refusal fine-tuning. Tested on Llama-2-7B-Chat and Qwen2.5-Instruct, this approach held refusal rates steady under attack, with refusal dropping by at most 10%. Compare that to the 70-80% drops seen in baseline models.

Finding that 13 different models all organize refusal through a similar linear structure tells us something fundamental about how LLMs learn safety constraints, an area where interpreting model internals is becoming increasingly critical. The rapid back-and-forth between attack and defense also shows the dual nature of this work: the same understanding that breaks safety can also build stronger protections.