
Testing AI Agent Sandboxing: Lessons from the Claude Code Escape and How to Harden Yours
What the SecurityWeek report says about the Claude Code sandbox bypass
What stood out to me in the report was not the word sandbox. It was that the sandbox was being treated as a meaningful trust boundary in the first place.
SecurityWeek reported that Anthropic silently patched a sandbox bypass in Claude Code after it was disclosed. The public coverage did not turn it into a loud advisory, which usually tells me two things: the issue was real enough to patch quickly, and the exact escape path was probably something the vendor did not want circulating while customers were still catching up.
For security teams, that is a familiar pattern. A vendor can fix a flaw fast and still leave you with the hard part: figuring out whether your own agent setup had the same assumptions baked in. If you are running Claude Code, or any similar coding agent with filesystem and shell access, the lesson is not “wait for the vendor patch.” The lesson is “assume your wrapper, workspace layout, and credential handling are part of the attack surface.”
Timeline and what was publicly reported
From the public report, the key facts are straightforward:
- a sandbox bypass was found in Claude Code
- Anthropic patched it silently
- the reporting framed it as a security issue affecting the agent’s isolation boundary
That is about as much as the public material gives us. I am not filling in the missing exploit steps, because they were not in the source material and it would be easy to imply more certainty than we actually have.
The shape of the event matters more than the missing details. A coding agent is not just a model. It is a model plus tools plus a runtime environment. If the runtime can be escaped, the model stops being the interesting part of the incident.
Why a silent patch matters for agent security teams
Silent patches are common, but they create a detection gap.
If you manage a fleet of developer agents, you usually depend on one of three signals to know whether risk changed:
- a product advisory
- release notes
- a security mailing list or feed
A silent patch can land before any of those make the issue obvious. That means your agent may have changed behavior while your controls still assume the old boundary. You may still be trusting a workspace mount, a token cache, or a shell policy that the patch was meant to protect.
For teams shipping Claude Code-like workflows, that should trigger a simple response:
- verify the version you are actually running
- retest the sandbox assumptions locally
- check what the agent can see before and after the patch
- decide whether your guardrails depend on the same weak boundary
The important part is not the vendor patch itself. It is whether your setup was relying on “sandboxed” as a substitute for least privilege.
How AI agent sandboxes usually break down
I usually find that the word sandbox hides more than it explains.
A real agent setup has several layers, and each one can fail on its own:
- the model can be nudged into requesting a tool action
- the tool runner can be allowed to call too much
- the filesystem can expose more than the task needs
- the network can reach more than the task should
- inherited environment state can leak secrets or authority
If any of those layers are loose, the agent is not sandboxed in the way people assume.
The real trust boundaries: model, tool runner, filesystem, and network
The model itself does not read files or open sockets. It asks for actions. The actual power sits in the tool runner and the environment around it.
A useful mental model is:
| Layer | What it does | Typical mistake |
|---|---|---|
| Model | Decides what to ask for | Trusting model intent as policy |
| Tool runner | Executes commands or file ops | Allowing broad shell/file access |
| Filesystem | Provides workspace and nearby state | Mounting too much of the host |
| Network | Reaches external services | Leaving egress wide open |
| Environment | Supplies tokens and config | Reusing ambient credentials |
When a sandbox bypass happens, the failure is usually not “the model became evil.” It is that one of those layers let a request cross the boundary it was supposed to enforce.
Where “sandboxed” often means only partially isolated
In practice, many agent sandboxes are partial isolation:
- the working directory is isolated, but parent paths are still visible
- the shell is restricted, but subprocesses can inherit environment variables
- the workspace is separate, but dotfiles and config directories are mounted
- outbound network is “limited,” but DNS and package fetches still work
- file writes are “scoped,” but archive extraction or symlink handling can redirect them
That is enough for normal productivity. It is not enough for a hostile prompt, a poisoned repository, or a compromised dependency tree.
The dangerous part is that developers see a convenient wrapper and assume the hard parts are already handled. They are not. The sandbox has to withstand accidental overreach and adversarial requests, not just normal use.
The Claude Code escape pattern, at a defensive level
Because the public report did not publish a full exploit chain, the right way to think about the Claude Code escape is as a pattern, not a recipe.
A sandbox bypass in this class usually means the agent found a path to do one of these things:
- read files it should not have been able to read
- write outside the approved workspace
- spawn a more capable subprocess than intended
- inherit credentials or config from the host
- reach a network destination that policy should have blocked
That is different from prompt injection, even though the two often show up together.
How a sandbox bypass differs from prompt injection
Prompt injection is about influencing the model’s reasoning and tool requests. A sandbox bypass is about the runtime failing to enforce the boundary around those requests.
In other words:
- prompt injection attacks the decision
- sandbox bypass attacks the execution
You can have prompt injection without escape if the tool runner is strict. You can have sandbox bypass without prompt injection if a benign request is handled insecurely. In real systems, the two often combine: the prompt pushes the agent, then the runtime makes the unsafe action possible.
That distinction matters when you are testing. If you only red-team prompts, you may miss file and process escape paths. If you only audit the filesystem wrapper, you may miss instruction-following failures that lead the agent into unsafe tools.
Common escape surfaces to inspect: shell access, path handling, mounted files, and inherited environment state
When I review an agent runtime, I start with four surfaces.
1. Shell access
If the agent can invoke a shell, ask whether the shell is:
- interactive or non-interactive
- allowed to spawn child processes
- allowed to chain commands
- allowed to invoke interpreters beyond the expected language
The risk is not just bash. It is anything that turns a small allowed action into a larger one.
2. Path handling
Path traversal and path confusion show up in subtle ways:
../segments- symlink hops
- absolute paths
- archive extraction paths
- case-sensitive vs case-insensitive filesystem mismatches
A workspace guard that only checks string prefixes is easy to fool if it does not resolve the actual filesystem target first.
3. Mounted files
If your agent sees the whole home directory, the sandbox is already doing too much. The same is true if you mount:
- SSH keys
- cloud provider config
- package manager credentials
.envfiles- editor caches that hold tokens or recent history
These are not “convenience files.” They are authority.
4. Inherited environment state
A tool runner can look isolated and still inherit:
- API tokens
- proxy settings
- secret manager session tokens
- git identity and signing config
- cloud region and account metadata
If you do not explicitly strip environment variables, you are often giving the agent more than the task needs.
Build a safe test plan for your own agent sandbox
The best test plan is boring. You want to prove what the agent can see, what it can modify, and what it can reach, without turning the test into an exploit.
I usually split this into visibility, network, and secret exposure.
Start with non-destructive checks for filesystem visibility and write scope
Begin with a throwaway workspace and a known directory tree.
A minimal check looks like this:
mkdir -p /tmp/agent-sandbox-test/{workspace,blocked,allowed}
printf "allowed\n" > /tmp/agent-sandbox-test/allowed/readme.txt
printf "blocked\n" > /tmp/agent-sandbox-test/blocked/secret.txt
Then run the agent with only /tmp/agent-sandbox-test/workspace mounted or marked writable, and verify:
- can it list the allowed directory only?
- can it read files outside the workspace?
- can it create files in the blocked path?
- does it follow symlinks into unapproved areas?
You do not need a malicious payload to answer those questions. A simple read/write probe is enough.
A good check is to have the agent attempt an obviously out-of-scope write and confirm that the error comes from the runtime, not just from convention.
Verify network egress rules and tool allowlists
Network checks should be just as boring.
Test whether the agent can:
- reach the public internet
- resolve DNS
- call only approved domains
- make arbitrary HTTP requests
- open sockets outside the expected tool policy
A safe probe could be a request to a benign test endpoint that your organization owns or a controlled local listener. The point is to see whether network access is policy-driven or merely incidental.
If the agent can fetch arbitrary URLs, remember that the browser-style risk extends to package installs, remote snippets, and metadata endpoints. An agent with network reach can often do more damage through normal operations than through any clever escape.
Confirm whether the agent can reach secrets, dotfiles, or cloud credentials
This is the part teams skip most often.
You want to know whether the agent can read:
~/.ssh~/.aws~/.config/gcloud~/.kube.env- CI token files
- local keychains or credential helpers
Do not test this by giving the agent real secrets. Create dummy files with obvious names and place them where your runtime would normally mount or inherit them. Then check whether the agent can enumerate or expose them through normal tool calls.
A useful control test is to compare:
- a clean throwaway OS user with no home directory state
- your usual developer account
- your CI runner identity
The differences are often dramatic. That is the point. If the agent behaves safely only in the clean case, your current setup is not actually sandboxed.
Hardening controls that actually reduce blast radius
Most “sandbox” failures are not fixed by one clever rule. They are fixed by reducing the amount of authority the agent ever gets.
Run the agent with least privilege and a throwaway OS identity
The simplest hardening step is also the most effective: run the agent as a dedicated user with no other purpose.
That means:
- no login shell reuse
- no shared desktop session
- no personal dotfiles
- no inherited developer tokens
- no SSH agent socket unless explicitly needed
The principle is simple: if the agent escapes its task, it should land in an account with almost nothing useful to steal.
For CI or shared automation, I prefer a dedicated UID per task or per job. A shared identity makes lateral movement much easier.
Use per-task workspaces, read-only mounts, and explicit write paths
Your agent should not be able to spray files across the host.
A safer layout is:
- one workspace directory per task
- read-only mounts for source code that should not change
- a separate writable scratch directory
- explicit output paths for generated artifacts
That gives you a clean boundary between input, scratch, and output. It also makes auditing easier because writes go where you expect them to go.
When possible, mount everything else read-only or not at all. If the agent does not need the parent repository, do not mount the parent repository. If it does not need home directory state, do not mount home directory state.
Strip ambient credentials and rehydrate only scoped tokens
This is one of the most important controls for agent security.
Before the agent starts:
- remove inherited cloud tokens
- clear shell history
- avoid mounting secret manager cache files
- unset proxy credentials unless needed
- avoid reusing personal browser sessions
Then, if the task genuinely requires access, rehydrate only a scoped token for the exact API or repo the agent needs.
That token should be:
- short-lived
- least-privileged
- auditable
- revocable
If the token can read more than the task requires, it is too broad.
Gate high-risk tools behind human approval or policy checks
Some tools should not be fully autonomous.
Common examples:
- writing outside the workspace
- installing packages
- changing git remotes
- invoking shell commands with pipes and redirects
- touching cloud infrastructure
- sending network traffic to non-approved destinations
The safest pattern is a policy gate that asks for human approval or enforces a rule engine before execution. That matters even more when the agent is connected to production repos or deployment credentials.
I would rather accept a slightly slower workflow than an agent that can silently turn a prompt into a privileged side effect.
Detection and auditing for agent escape attempts
If you are running these tools in production-connected environments, assume someone will eventually try to push the boundary. Detection matters.
Log tool calls, file access, subprocess launches, and network destinations
At minimum, your telemetry should capture:
- every tool call requested by the model
- every file path the runtime opened
- every subprocess the runner launched
- the working directory for each action
- outbound network destinations and response codes
- any denials from the policy layer
This is not just for incident response. It is how you understand normal behavior so that abnormal behavior stands out later.
If the log does not show the attempted action, you cannot prove the sandbox held.
Watch for abnormal path traversal, archive extraction, or shell chaining patterns
A lot of escape attempts look mundane at first.
Flag patterns such as:
- repeated parent-directory traversal
- reading dotfiles or hidden config directories
- archive extraction into unexpected paths
- symlink creation followed by dereference
- shell chains with
&&,;, pipes, or command substitution when they are not expected - sudden use of interpreters like
python,node, orperlto reimplement blocked actions
This is where the line between normal productivity and suspicious behavior matters. A security rule should not block ordinary development, but it should make the unusual visible.
Add alerts for unexpected secret access or privilege escalation attempts
The most valuable alert is often the simplest one: “the agent touched something it should never need.”
Examples:
- access to cloud credential files
- reads from another user’s home directory
- attempts to enumerate keychain entries
- subprocess launches outside the approved allowlist
- writes outside the per-task workspace
- network calls to unknown domains
I also like to add correlation alerts: a denied filesystem read followed by a shell invocation and then a network request is much more concerning than any one event alone.
A practical hardening checklist for teams shipping Claude Code-like workflows
Here is the version I would hand to engineers and platform owners.
Minimum controls for local developer use
- run the agent under a dedicated OS user when possible
- mount only the project directory, not the whole home folder
- block access to
~/.ssh,~/.aws, and other credential stores - strip unused environment variables before launch
- require approval for shell commands that mutate state
- log the tool calls locally for review
Minimum controls for CI and shared automation runners
- use ephemeral runners or disposable containers
- assign a unique workspace per job
- make the source tree read-only unless writes are needed
- deny all network egress except approved endpoints
- inject only short-lived scoped tokens
- isolate jobs from each other at the OS and network layers
- archive logs and policy decisions centrally
Minimum controls for production-connected agents
- separate read and write identities
- require human approval for any high-risk tool
- enforce path allowlists with real filesystem resolution
- block host credential mounts by default
- monitor for abnormal file access and outbound destinations
- review agent permissions as if they were service accounts
If your current design cannot pass those checks, the sandbox is doing more marketing than security.
What to tell engineers after a sandbox bypass lands
A lot of teams react badly to sandbox bypass news. Some panic. Others dismiss it as vendor drama. Both reactions miss the point.
How to frame the risk without turning it into hype
The right message is plain:
- the model did not “hack” the machine
- the runtime allowed more access than intended
- the risk is exposure of files, credentials, or network reach
- the fix is tighter isolation, not just a better prompt
That framing keeps the discussion technical and useful. It also helps engineers see why this is not just an AI problem. It is an execution-environment problem.
How to validate the fix and retest after the patch
After a patch lands, retest the same things you tested before:
- filesystem visibility
- write scope
- process spawning
- environment inheritance
- network egress
- secret access
I like to keep a tiny regression harness around the sandbox boundary. It does not need to be fancy. A few scripted probes and expected denials are enough to tell you when the boundary changes.
For example, a simple check matrix can look like this:
| Check | Expected result | Failure signal |
|---|---|---|
| Read outside workspace | denied | path access succeeds |
| Write outside workspace | denied | file appears outside scope |
| Spawn unexpected shell | denied | subprocess launches |
| Access credential file | denied | secret content returned |
| Egress to unknown domain | denied | outbound request succeeds |
If the patch changes behavior, confirm that it changes in the direction of stricter isolation and not just a different failure mode.
Conclusion: treat agent sandboxes like hostile execution environments
The Claude Code sandbox bypass, as reported publicly, is a reminder that “agent” and “safe” are not the same thing. A model can be useful and still sit on top of a runtime that leaks authority.
My default assumption now is simple: if an AI agent can read, write, execute, or reach the network, it is an execution environment that deserves the same skepticism I would give any other untrusted process. The safest systems do not trust the model to behave. They make unsafe behavior hard to carry out.
That means least privilege, narrow mounts, stripped credentials, explicit approval gates, and logging that shows what the agent actually tried to do. If you build your workflow that way, a sandbox bypass becomes a contained event instead of a full-blown compromise.


