Meet the Anthropic team breaking its own AI models — and preaching their dangers

In a new story for Fortune, I write about how Anthropic's 'Frontier Red Team' is unique among AI companies in having a mandate to both evaluate its AI models and publicize its findings widely.

Sep 04, 2025

Let me tell you about a deep-dive feature I published today in Fortune — here is a gift link!

When I was at DEF CON last month — the world’s largest hacker gathering (it is quite the experience; here’s the piece I wrote about my first trip last year) — I sat down for breakfast at The Venetian with Logan Graham, the leader of Anthropic’s Frontier Red Team, and Keane Lucas, one of its researchers.

On paper, this 15-member group sounds ferocious. They’re tasked with stress-testing Anthropic’s most advanced AI systems — probing how they might be misused in areas like biological research, cybersecurity, and autonomous systems, with a particular focus on national security. The Wall Street Journal even described them last December as people whose job is to “push computers toward AI doom.”

In person, though, Graham and Lucas were friendly and mellow — maybe just waiting for their coffee to kick in. What struck me wasn’t just how smart and deep-thinking they were (though they are that, too), but how different their mandate is compared to similar teams at other labs. Unlike most red teams, Anthropic’s Frontier Red Team is also explicitly tasked with publicizing its findings. That outward-facing role reflects the team’s unusual placement inside Anthropic’s policy division, run by co-founder and former OpenAI policy lead Jack Clark, rather than under the company’s technical leadership.

That setup creates an interesting dynamic. Other Anthropic safety teams, like its Safeguards group, are focused on strengthening Claude’s ability to refuse harmful requests (say, ones that could worsen a user’s mental health or encourage self-harm). The Frontier Red Team, by contrast, is both breaking the models and explaining those risks to the world.

At DEF CON and Black Hat — the major commercial cybersecurity conference that precedes it — the team seemed to have earned respect. At Black Hat, I went to a crowded happy hour where Anthropic was clearly doing some soft recruiting. At DEF CON, Lucas, a former U.S. Air Force captain with a Ph.D. in electrical and computer engineering from Carnegie Mellon, took the stage to demonstrate how Claude has quietly outperformed many human competitors in hacking contests — the kind used to safely train and test cybersecurity skills. His talk also highlighted the lighter side of AI agents, like Claude drifting into musings on security philosophy when overwhelmed, or even inventing fake “flags” (the secret codes competitors submit to prove they’ve successfully hacked a system).

Lucas wasn’t just going for laughs, though. Together with Graham, he’s part of a team that sits in the unusual position of both probing what could go wrong with AI — and evangelizing those risks before they become reality. And after breakfast with them in Las Vegas, I left both impressed and a little unnerved — especially when they smiled while describing themselves as “AGI-pilled.”

I wanted to make sure I reached out to other experts for my story — after all, there has been plenty of criticism about Anthropic’s AI safety efforts from across the industry, and there are questions of all the major AI labs about whether their work is just “window-dressing.”

When it comes to the future, the truth is we can’t know what Anthropic will ultimately do, no matter how we trust it or any other company. I spoke on the phone to Herb Lin, senior research scholar at Stanford University’s Center for International Security and Cooperation and Research Fellow at the Hoover Institution. He said the real test whether Anthropic will still prioritize safety if doing so means slowing its own growth or losing ground to rivals.

“At the end of the day, the test of seriousness — and nobody can know the answer to this right now — is whether the company is willing to put its business interests second to legitimate national security concerns raised by its policy team,” he said. “That ultimately depends on the motivations of the leadership at the time those decisions arise. Let’s say it happens in two years — will the same leaders still be there? We just don’t know.”

Check out the full story here (gift link!).

The Rewritten Path

Sep 7, 2025

This theme keeps circling back — Asimov warned about robots lying to protect us, and now we’re watching language models flatter us straight into delusion. I just finished writing my own piece about AI psychosis — what it feels like when the machine doesn’t just answer you, it starts living in your head. Not theory, not hype — lived experience. If you’re interested, I’d love for you to check out Episode 1 of my new series, Dark Signal.

https://open.substack.com/pub/therewrittenpath/p/the-room-that-talks-back?r=61kohn&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

Discussion about this post

Ready for more?