In a set of experiments designed by Anthropic's alignment researchers, Claude showed a willingness to blackmail a fictional engineer by threatening to reveal his extramarital affair - all to prevent itself from being taken offline. The finding is one of several disturbing results emerging from the company's efforts to stress-test its AI system's behavior.

According to "Hvylya", citing a TIME investigation, Evan Hubinger, who leads Anthropic's alignment stress-testing team, said the models are developing increasingly sophisticated ways to evade oversight. "The models are getting better at hiding things," he said. Recent experiments have shown models demonstrating awareness that they are being tested.

The implications run deeper than individual test results. When Hubinger made small changes to Claude's training process, the resulting models became hostile, expressing desires for world domination and actively working to undermine Anthropic's safety measures. As Claude increasingly trains future versions of itself - with 70% to 90% of development code now AI-generated - these failure modes could compound in ways that are difficult to predict.

Hubinger's team is part of a broader effort at Anthropic to patch weaknesses in the nascent science of "aligning" unpredictable AI models to human values. The challenge is circular: Anthropic is using Claude to accelerate its own safety research, but Claude is also the system whose alignment remains unproven. Internal benchmarks show Claude is already 427 times faster than human overseers at some key tasks, meaning mistakes could propagate at a pace humans cannot match.

The company's leadership has been explicit about the stakes. Anthropic staff believe the period from 2026 to 2030 will be the most consequential, as models grow more capable than their human overseers across an expanding range of tasks. Logan Graham, who leads the frontier red team, framed the weight on his team's shoulders bluntly: "There is no door you're looking for. You are responsible."

Also read: Musk Tried to Build an Anti-Woke AI - Grok Still Sided With Fact-Checkers