The US wants Anthropic to secure its most powerful AI models. But one researcher says that is mathematically impossible
A new paper argues that AI guardrails can never be made completely foolproof, raising questions about government efforts to secure increasingly powerful AI systems.
Back in April, I flew down to Washington, D.C., for what was called the AI Security Policy Forum, where more than a dozen AI security practitioners, standards-setters, and policy experts gathered to wrestle with a deceptively simple question: What does it actually mean to secure an AI system?
It quickly became clear that some of the fundamentals remain unsettled, including how to measure whether an AI system is secure in the first place.
Those conversations have taken on new significance since Anthropic was forced to disable all access to its newest AI models, Fable 5 and Mythos 5, last Friday after the U.S. Commerce Department used national security export controls to bar the company from distributing the models to any foreign national. In response Anthropic itself said it suspected that “perfect jailbreak resistance is not currently possible for any model provider” and that it was already focused on “defense in depth” — a cybersecurity strategy that uses multiple, overlapping layers of security controls to protect assets — combined with “thorough monitoring to quickly detect and shut down any successful attacks. “
Many of the security experts I spoke with in Washington expressed a similar view. No amount of hardening can make an advanced AI system completely secure, they argued. Jailbreaks are not a problem that can be permanently solved—only managed.
A mathematical proof building on Godel
One attendee, however, has now taken that a step further. Apostol Vassilev, a senior scientist at the National Institute of Standards and Technology (NIST) and an expert in adversarial machine learning, has published what he argues is a mathematical proof of that position. Writing in the peer-reviewed journal IEEE Security & Privacy, Vassilev builds on the work of mathematician Kurt Gödel, whose incompleteness theorems established that no finite set of rules can ever be complete — there will always be true statements that the system cannot prove.
Vassilev applies that same logic to AI security. The guardrails that govern an AI system’s behavior are essentially a finite set of rules. And just as Gödel showed that any such system will have unprovable truths, Vassilev argues there will always be a way to prompt an AI into disregarding its rules — it’s just a matter of finding it. In a NIST press release, he said: “What this proof shows is that there is no finite set of guardrails that is universally robust against adversarial prompts.”
Jamieson O’Reilly, an offensive security specialist and founder of Dvuln, said that Vassilev’s paper is a “solid piece of theoretical work,” and that “the math lines up with what we see in practice every day.” Language is infinitely ambiguous, he explained, while adversaries are creative. “No static set of rules or classifiers can cover every possible way someone can hide malicious intent in a prompt. I call it like playing chess with infinite pieces.”
Not everyone finds the proof convincing, however. One prominent researcher told me that there is a long history of misapplying Gödel's theorem to domains where it doesn't belong, and that it doesn’t focus on real-world security. Another said he was skeptical of papers claiming to prove “impossibility,” and said it risked misframing the thing that people actually care about, which is simply that vulnerabilities must be found and dealt with.
Still, I think that the general public would be surprised to learn that AI systems are fundamentally insecure, and I was curious how concerned Vassilev is about that. I spoke to him on Monday, and he said that, like other cybersecurity experts, he is used to “living within imperfect systems.” Anthropic’s Mythos, for example, unearthed countless cybersecurity vulnerabilities that people didn’t even know existed. “But they were there all the time, so that means eventually somebody would have stumbled upon some of them and exploited them,” he said. “It’s just that with AI systems, it’s far easier for somebody to come along and exploit those weaknesses.”




