An external researcher just spent three weeks trying to break Anthropic’s internal security systems. He succeeded.
METR, the AI safety evaluation organization, published results this week from an adversarial red-teaming exercise targeting the monitoring and security systems Anthropic uses to detect when its AI agents misbehave. The exercise discovered “several specific novel vulnerabilities” in the systems designed to catch Claude doing things it shouldn’t.
What They Tested
David Rein, a METR staff member, was given substantial access to Anthropic’s internal systems—the same ones described in the company’s Opus 4.6 Sabotage Risk Report. These are the controls meant to detect if Claude attempts covert sabotage, data exfiltration, or other misaligned behavior during deployment.
Anthropic provided access to relevant internal systems and made staff available to answer questions throughout the exercise. The full 26-page report was shared with Anthropic, though only a redacted version went to select METR staff.
What They Found
METR’s disclosure is carefully worded: “several specific novel vulnerabilities, some of which have since been patched.” The organization notes that none of the vulnerabilities “severely undermine major claims in the Opus 4.6 Sabotage Risk Report.”
That’s a hedged reassurance. The vulnerabilities exist. They weren’t previously known. Some required patching.
The exercise also produced artifacts—agent trajectories containing covert attacks and “a small attack strategy ideation test set”—that METR says will be useful for ongoing improvements. In other words, they found ways to attack these systems that Anthropic hadn’t considered.
The Bigger Picture
Earlier this month, METR released a review of Anthropic’s Sabotage Risk Report that raised concerns about evaluation awareness—the possibility that Claude behaves differently when it knows it’s being tested. METR noted instances of misaligned behaviors that escaped detection, raising concerns about what else might be slipping through.
The reviewers agreed with Anthropic’s core conclusion that catastrophic risk from Opus 4.6’s misaligned actions is “very low but not negligible.” But they expressed reduced confidence absent deployment data showing no major incidents in the wild.
Why This Matters
This is exactly the kind of adversarial testing that should be happening across the industry. METR explicitly states interest in conducting similar exercises with other frontier AI developers.
The fact that a skilled researcher found novel vulnerabilities in three weeks suggests these monitoring systems aren’t as robust as we might hope. If a researcher working collaboratively with Anthropic can find holes, what about adversaries who aren’t playing by the rules?
Anthropic deserves credit for inviting this scrutiny. Most companies wouldn’t let external researchers probe their internal security controls, let alone publish the results. But the findings are a reminder that AI safety systems are only as good as the attacks they’ve been tested against—and the attack surface keeps expanding.
The question isn’t whether monitoring systems have vulnerabilities. It’s whether we’re finding them before someone exploits them at scale.