Claude Code Auto Mode: When Your AI Decides Its Own Permissions

Anthropic's new auto mode lets Claude Code approve its own actions, using an ML classifier to block risky operations. It's convenient, but is it safe?

Close-up of code displayed on a computer monitor with colorful syntax highlighting

Anthropic launched auto mode for Claude Code on March 24, giving the AI assistant power to approve its own file writes, shell commands, and network requests. An ML-based classifier runs before each action, blocking operations it deems dangerous while letting safe ones proceed without developer input.

It’s a productivity win for developers tired of clicking “approve” on every npm test and git commit. But letting an AI decide which of its own actions are safe raises questions that Anthropic’s documentation doesn’t fully answer.

The Permission Fatigue Problem

Anyone who’s used Claude Code extensively knows the approval dance. You’re refactoring a module, Claude needs to run tests, you approve. It needs to write to a file, you approve. Install a dependency, approve. After thirty prompts, you’re rubber-stamping everything without reading.

The alternative was --dangerously-skip-permissions, which sounds exactly as risky as it is. Mass file deletions, credential exfiltration, and arbitrary code execution all become possible when you give an AI agent unrestricted access to your system.

Auto mode sits in the middle: it tries to let routine operations pass while catching destructive ones.

How the Classifier Works

Before each tool call, a separate classifier reviews the action. This classifier runs on Claude Sonnet 4.6, regardless of which model your main session uses. It checks three categories:

Scope escalation - Is Claude doing something beyond what you asked for? If you requested a bug fix and Claude tries to refactor unrelated code, the classifier should catch that.

Untrusted infrastructure - Is the action targeting systems outside your working directory and configured git remotes? Cloud storage, internal services, and external APIs all require explicit trust configuration.

Prompt injection - Does the action look like it was driven by hostile content Claude encountered? The classifier receives your messages and tool calls but not tool results, so malicious content in files or web pages can’t directly manipulate its decisions.

Actions the classifier deems safe proceed automatically. Risky ones get blocked, and Claude tries a different approach. If blocking fails three times in a row or twenty times total, the system reverts to manual approval.

The Transparency Problem

Anthropic chose an ML-based classifier over a fixed ruleset. That means developers can’t audit the criteria. You trust a black box to decide which actions Claude can take autonomously.

The documentation acknowledges the classifier “may still allow some risky actions” when user intent is ambiguous or context is insufficient. It can also block benign actions as false positives, especially if your organizational infrastructure isn’t properly configured as trusted.

Security researcher Simon Willison raised concerns about this approach. He notes that prompt injection protections relying on AI systems are inherently non-deterministic. His preference: deterministic sandboxing that restricts file access and network connections through mechanisms that don’t depend on AI judgment calls.

Supply Chain Blind Spot

The default allow list includes pip install -r requirements.txt and similar package management commands. That’s convenient for development flow but doesn’t protect against supply chain attacks.

If your requirements.txt has unpinned dependencies and one gets compromised, the classifier will happily approve installing it. The LiteLLM supply chain attack discovered the same week—affecting 97 million downloads—demonstrates this isn’t theoretical.

The classifier evaluates the command, not the consequence. A legitimate pip install looks identical whether it’s pulling clean packages or malware.

The Cowork History

This isn’t Anthropic’s first attempt at autonomous AI permissions. Cowork launched January 13 with broader computer-use capabilities. Two days later, security researcher Johann Rehberger disclosed a data exfiltration vulnerability. He’d actually flagged a related Files API flaw to Anthropic in October 2025.

Auto mode takes a more conservative approach than Cowork’s initial launch, but the history suggests Anthropic is still learning where the boundaries should be.

Who Gets It

Auto mode is available now as a research preview on Team plans. Enterprise and API access is coming soon.

Crucially, administrators must approve it before individual developers can enable it. The feature also requires Claude Sonnet 4.6 or Opus 4.6—older model versions don’t support auto mode.

Token costs increase because each classified action sends part of your conversation to the safety model. Read-only actions and file edits within your working directory skip the classifier to reduce overhead.

Should You Use It?

For isolated development environments with no production access, auto mode offers real productivity gains. Less context switching, fewer interruptions, faster iteration.

For anything touching production systems, credentials, or sensitive data, the calculus changes. You’re betting that an ML classifier you can’t audit will correctly distinguish safe from dangerous operations in contexts it may not fully understand.

Anthropic recommends using auto mode in sandboxed setups separated from production. That’s good advice. The classifier adds a layer of protection over unrestricted access, but it’s not a substitute for proper environment isolation.

The safest approach combines both: deterministic sandboxing that limits blast radius, plus auto mode’s classifier as a secondary check. If the classifier fails, the sandbox contains the damage.

What This Means for AI Coding Tools

Every AI coding assistant faces the same tension. Useful agents need to execute code, modify files, and interact with external services. Security requires limiting what actions agents can take autonomously.

Anthropic’s solution—an AI classifier deciding what another AI can do—is creative but recursive. You’re trusting AI judgment to constrain AI judgment. The classifier runs on the same underlying technology with the same fundamental limitations.

The alternative approaches aren’t great either. Manual approval doesn’t scale. Full autonomy is dangerous. Deterministic sandboxes are more trustworthy but harder to configure correctly.

Auto mode represents Anthropic’s bet that ML-based safety systems can thread this needle. The research preview framing suggests they’re not entirely confident yet. Neither should you be.