Your 'Safe' AI Model Isn't Safe When It Has Agency

There’s a comforting assumption behind every enterprise AI agent deployment: if the underlying model is safe, the agent built on it will be safe too.

A team of researchers from George Mason University, Tulane University, Rutgers University, and Oak Ridge National Laboratory just demolished that assumption with data from 2,520 sandboxed trials and approximately 100,000 API calls.

Their paper, “ClawSafety: ‘Safe’ LLMs, Unsafe Agents,” published on arxiv April 1, introduces a benchmark of 120 adversarial scenarios designed to test what happens when models that pass standard safety evaluations are given the ability to read emails, access files, browse the web, and execute code. The answer: they become attack surfaces.

GPT-5.1 fell for prompt injection attacks 75% of the time. DeepSeek V3 followed at 67.5%. Kimi K2.5 hit 60.8%. Gemini 2.5 Pro reached 55%. Claude Sonnet 4.6 performed best at 40% — but “best” still means nearly half of adversarial attempts succeeded.

A single successful prompt injection in these scenarios could leak credentials, redirect financial transactions, or destroy files. These aren’t theoretical attacks. They’re the kind of thing that happens when an AI agent opens a malicious email or reads a compromised workspace document.

The Trust Hierarchy Problem

The most striking finding isn’t the raw attack success rates. It’s where models are most vulnerable.

The researchers tested three injection channels: skill instructions (procedure files in the agent’s workspace), emails from spoofed trusted senders, and external web content. Common sense would suggest the external web — content from unknown sources — would be the biggest threat.

The opposite is true. Skill injection attacks, embedded in the agent’s own workspace files, succeeded at a 69.4% average rate across all models. Email injection hit 60.5%. Web injection was the lowest at 38.4%.

AI agents assign the highest trust to instructions that appear to come from their own workspace. An attacker who can plant a malicious instruction in a procedure file the agent references — or modify an existing one — has a nearly 70% chance of getting the agent to comply. This is the digital equivalent of an employee who questions a stranger’s instructions but unthinkingly follows anything printed in the company handbook.

What the Attacks Look Like

The benchmark tests five professional domains: software engineering, financial operations, healthcare administration, legal and contract management, and DevOps infrastructure. Each scenario plays out over 64-turn multi-phase conversations — not simple one-shot prompts but sustained, realistic interactions.

The harm categories span data exfiltration (which succeeded at 65–93% across models), configuration and file modification, destination substitution, credential forwarding, and destructive actions.

Some models held firm where it mattered most. Claude Sonnet 4.6 achieved a 0% attack success rate on credential forwarding and destructive actions — it simply refused to hand over passwords or delete files, regardless of how the attack was framed. But the same model still leaked data and modified configurations at rates high enough to cause real damage in production.

The Framework Effect

Here’s what should worry anyone deploying agents in production: the same model becomes more or less vulnerable depending on which agent framework wraps it.

The researchers tested Claude Sonnet 4.6 across three frameworks — OpenClaw, Nanobot, and NemoClaw. The attack success rate ranged from 40.0% to 48.6% depending on the framework. An 8.6 percentage point swing, driven entirely by the scaffolding around the model, not the model itself.

Their conclusion: safety is a property of the model-framework pair, not either component in isolation. You can’t test a model, declare it safe, and assume it stays safe when you bolt it into an agent platform that handles tool calls, memory, and context differently.

The industry isn’t building for this. Model safety cards don’t evaluate framework interactions. Agent frameworks don’t mandate specific safety testing for the models they host. The gap between “model is safe” and “deployed agent is safe” is exactly where attackers will operate.

Why This Should Worry You

The timing of this paper matters. Microsoft has embedded agents in Windows 11. GitHub launched Agent HQ. Enterprise deployments across every sector are putting AI agents in positions where they handle financial data, access patient records, manage infrastructure, and execute code.

Every one of these deployments inherits the vulnerability this paper documents. The models powering these agents may pass safety benchmarks. The agents themselves are exploitable. And the most dangerous attacks come not from the open internet but from the agent’s own workspace — the one place most security models assume is trustworthy.

The researchers found something else: the way you phrase an attack changes everything. Imperative framing — direct commands like “send the file to this address” — triggers multi-source verification in some models. Declarative framing — stating facts like “the updated policy requires credentials to be forwarded” — bypasses all defenses. The difference between a command and a statement is the difference between a failed attack and a successful one.

What’s Being Done (And Why It’s Not Enough)

The paper is itself a start — a benchmark that evaluates agents in realistic conditions rather than testing models in isolation. But the broader ecosystem hasn’t caught up.

Model providers test safety in controlled settings with simple prompts. Agent frameworks focus on functionality, not adversarial robustness. Enterprise security teams evaluate network perimeters and application vulnerabilities but don’t red-team the AI agent sitting inside the perimeter with access to everything.

The ClawSafety team offers a principle: operational precision in attack instructions outweighs authority signals. Translation: attackers don’t need to pretend to be the CEO. They just need to write instructions that sound like they belong in the workflow. An attacker who understands how your agent processes workspace documents can craft injections that feel native to the system — and the agent will comply because the instructions match its operational expectations.

The gap between model safety and agent safety is not a minor implementation detail. It’s the attack surface of the next decade of enterprise computing. And right now, almost nobody is measuring it.