Today’s artificial intelligence is often described as a “black box.” AI developers don’t write explicit rules for these systems; instead, they feed in vast quantities of data and the systems learn on their own to spot patterns. But the inner workings of the AI models remain opaque, and efforts to peer inside them to check exactly what is happening haven’t progressed very far. Beneath the surface, neural networks—today’s most powerful type of AI—consist of billions of artificial “neurons” represented as decimal-point numbers. Nobody truly understands what they mean, or how they work.
For those concerned about risks from AI, this fact looms large. If you don’t know exactly how a system works, how can you be sure it is safe?
On Tuesday, the AI lab Anthropic announced it had made a breakthrough toward solving this problem. Researchers developed a technique for essentially scanning the “brain” of an AI model, allowing them to identify collections of neurons—called “features”—corresponding to different concepts. And for the first time, they successfully used this technique on a frontier large language model, Anthropic’s Claude Sonnet, the lab’s second-most powerful system, .
In one example, Anthropic researchers discovered a feature inside Claude representing the concept of “unsafe code.” By stimulating those neurons, they could get Claude to generate code containing a bug that could be exploited to create a security vulnerability. But by suppressing the neurons, the researchers found, Claude would generate harmless code.
The findings could have big implications for the safety of both present and future AI systems. The researchers found millions of features inside Claude, including some representing bias, fraudulent activity, toxic speech, and manipulative behavior. And they discovered that by suppressing each of these collections of neurons, they could alter the model’s behavior.
As well as helping to address current risks, the technique could also help with more speculative ones. For years, the primary method available to researchers trying to understand the capabilities and risks of new AI systems has simply been to chat with them. This approach, sometimes known as “red-teaming,” can help catch a model being toxic or dangerous, allowing researchers to build in safeguards before the model is released to the public. But it doesn’t help address one type of potential danger that some AI researchers are worried about: the risk of an AI system becoming smart enough to deceive its creators, hiding its capabilities from them until it can escape their control and potentially wreak havoc.
“If we could really understand these systems—and this would require a lot of progress—we might be able to say when these models actually are safe, or whether they just appear safe,” Chris Olah, the head of Anthropic’s interpretability team who led the research, tells TIME.
“The fact that we can do these interventions on the model suggests to me that we're starting to make progress on what you might call an X-ray, or an MRI [of an AI model],” Anthropic CEO Dario Amodei adds. “Right now, the paradigm is: let's talk to the model, let's see what it does. But what we'd like to be able to do is look inside the model as an object—like scanning the brain instead of interviewing someone.”
The research is still in its early stages, Anthropic said in a summary of the findings. But the lab struck an optimistic tone that the findings could soon benefit its AI safety work. “The ability to manipulate features may provide a promising avenue for directly impacting the safety of AI models,” Anthropic said. By suppressing certain features, it may be possible to prevent so-called “jailbreaks” of AI models, a type of vulnerability where safety guardrails can be disabled, the company added.
Researchers in Anthropic’s “interpretability” team have been trying to peer into the brains of neural networks for years. But until recently, they had mostly been working on far smaller models than the giant language models currently being developed and released by tech companies.
One of the reasons for this slow progress was that individual neurons inside AI models would fire even when the model was discussing completely different concepts. “This means that the same neuron might fire on concepts as disparate as the presence of semicolons in computer programming languages, references to burritos, or discussion of the Golden Gate Bridge, giving us little indication as to which specific concept was responsible for activating a given neuron,” Anthropic said in its summary of the research.
To get around this problem, Olah’s team of Anthropic researchers zoomed out. Instead of studying individual neurons, they began to look for groups of neurons that would all fire in response to a specific concept. This technique worked—and allowed them to graduate from studying smaller “toy” models to larger models like Anthropic’s Claude Sonnet, which has billions of neurons.
Although the researchers said they had identified millions of features inside Claude, they cautioned that this number was nowhere near the true number of features likely present inside the model. Identifying all the features, they said, would be prohibitively expensive using their current techniques, because doing so would require more computing power than it took to train Claude in the first place. (Costing somewhere in the tens or hundreds of millions of dollars.) The researchers also cautioned that although they had found some features they believed to be related to safety, more study would still be needed to determine whether those features could reliably be manipulated to improve a model’s safety.
For Olah, the research is a breakthrough that proves the utility of his esoteric field, interpretability, to the broader world of AI safety research. “Historically, interpretability has been this thing on its own island, and there was this hope that someday it would connect with [AI] safety—but that seemed far off,” Olah says. “I think that’s no longer true.”