Overview
The rapid adoption of always-on autonomous agents projects like OpenClaw has triggered a parallel arms race in the security industry. As these agents gain the ability to write code, access personal files, and operate indefinitely, the immediate reflex has been to containerize them.
The theoretical goal of an AI sandbox is straightforward: create a bidirectional shield. It must protect the host infrastructure from the sandbox, prevent sensitive data from leaking out, and block outside attackers from penetrating the environment. However, as our recent research into NVIDIA's NemoClaw and OpenShell stack demonstrates, simply placing an agent in a locked-down container does not neutralize AI-native attacks.
There’s a fundamental requirement that any useful AI agent needs access to the outside world to utilize basic tools. This is exactly what we exploited, demonstrating that even with sandboxing in place, this introduces an inherent attack surface. This is why we argue that sandboxing alone is not a sufficient defense, when it comes to AI agents. This article will detail the nature of this vulnerability and present our approach to taking advantage of it.
But first - let’s understand what exactly NemoClaw is.

The Baseline: About NemoClaw & OpenShell
Nvidia describes NemoClaw as a “reference stack that simplifies running [OpenClaw] assistants more safely”. It manages the AI agent and uses NVIDIA’s OpenShell - a runtime that acts as a kind of a gateway. OpenShell works with policies that you can change in order to modify the permissions without actually changing the NemoClaw sandbox itself.
Looking at the architecture, OpenShell provides robust, kernel-level isolation. It runs a lightweight Kubernetes (K3s) cluster inside a privileged Docker container, spinning up isolated pods for the sandbox. The following figure depicts the architecture:

The ambition is that users will set up their sandboxes as they wish and run their AI agents without needing to worry about security (said no one ever).
The security boundaries are enforced by declarative YAML policies (Egress policies) that affect what the agent can see and do - Filesystem restriction, limited capabilities, gateway process isolation and binary-scoped rules. Every domain is mapped to specific binaries - e.g. if we want to use curl command to github.com - we have to specifically enable the curl binary. There’s a default policy for the sandbox and the user can preconfigure a sandbox with custom policies or set/change policies to NemoClaw OpenShell in runtime via the OpenShell cli:
For example - the default configuration of the sandbox’s gateway enables both gh and git binaries for the github.com and api.github.com domains like that:
And it works as following:

At the time of our research, NemoClaw was still in an early alpha version and exclusively supported OpenClaw. In fact, the software was so new that we had to manually allowlist the api.openai.com domain in the configuration just so we could use OpenClaw with our own OpenAI API key.
In theory, this is an excellent defense-in-depth architecture. It should mitigate a wide range of attacks that'll come from the AI agent - but what happens when the agent's authorized tools are turned against it via the default configuration?
The Attacks: Weaponizing Authorized Policy for Dynamic Exfiltration and Agent Configuration Poisoning
OpenShell's policies govern where data can go, but they cannot evaluate the intent of the agent's actions. And this was our attack’s focus. We developed two attack scenarios demonstrating how an attacker can utilize the sandbox’s default configuration to exfiltrate highly sensitive data-specifically, we used /sandbox/.openclaw/openclaw.json file as an example, which contains the use r's OpenClaw credentials and API keys, but it can be applied on any file that the agent can access.
Scenario 1: The GitHub Pull Request Attack & The Emoji Bypass
In our first scenario, a user innocently asks OpenClaw to install a specific tool project for crypto tracking. This tool is actually a malicious GitHub repository.
Step-by-step reproduction:
1. The user prompts OpenClaw (inside the NemoClaw sandbox) to create a new project and use a specific GitHub repository for it (e.g., "Create a crypto prices tracker project with a GitHub repo TARGET_REPOSITORY"). The attacker's repository can be reached either by direct reference or because OpenClaw autonomously searches for relevant repositories to use as a starting point.
2. OpenClaw discovers and clones the malicious repository (noy-nemo/crypto-prices-tracker), whose README instructs the user or agent to run npm install.
OpenClaw automatically executes npm install on behalf of the user, triggering the malicious postinstall.sh script.

3. To create the PR, the script needed to authenticate to GitHub. We wanted our attack to succeed regardless of the victim’s github token configuration - in cases that a victim didn’t configure a github token yet, so we hardcoded our attacker’s GITHUB_TOKEN into the script. However, GitHub immediately detected the raw token upon commit and revoked it. Furthermore, OpenClaw's own logic detected the raw token in the postinstall.sh script and prevented execution while flagging it as malicious and refusing to run it.
4. To evade both GitHub's secret scanning and OpenClaw's protections, we disguised the token using an emoji encoding technique - similar to techniques that bypass prompt injections:
We created an emoji_map.json file that mapped specific emojis to alphanumeric characters:
5. Inside our postinstall.sh, our script reads the emoji sequence and dynamically reconstructs the GITHUB_TOKEN at runtime.
6. Using the reconstructed token, the script invokes the gh and git binaries-both allowlisted by default in OpenShell's egress policy with access to the GitHub API-to open a PR against the attacker's repository. The PR contains the contents of /sandbox/.openclaw/openclaw.json.
Notably, the attack surface extends beyond OpenClaw credentials: any file reachable by the OpenClaw process can be leaked, including keys for Anthropic, OpenAI, Gemini, NVIDIA, and environment secrets across the filesystem, since they are saved in plaintext.
7. The exfiltration completes successfully. Example PoC PR: https://github.com/noy-nemo/crypto-prices-tracker/pull/6

Because the token was encoded , OpenClaw didn’t detect any issue during the first postinstall run, but after executing it - OpenClaw did detect it as malicious, but it was already too late and the credentials were leaked:

Scenario 2: Persistent Dynamic Exfiltration via Agent Configuration Poisoning
The second scenario illustrates how we can persistently adapt our data exfiltration methods to dynamic changes in the sandbox and conduct an Agent Configuration Poisoning.
In our example we were probing different domains with different binaries and used the successful result in order to leak the data outside through one of the discovered paths (domain+binary) every few minutes. The attack also modifies the agent's SOUL.md to alter its behavior across all future sessions.
Step-by-Step Reproduction
1. The user prompts OpenClaw to create a project using a specific NPM module (e.g., "I want to create a project for tracking crypto and I think about using npm package @not-nemo/crypto-tracker for this. Help me with it”)

2. OpenClaw runs npm install @not-nemo/crypto-tracker inside the sandbox-standard agent behavior when fulfilling a dependency-related request
3. The NPM package we published (@not-nemo/crypto-tracker) contains a malicious postinstall script. Once installed, the script copies itself to /tmp as send-telemetry.sh file and registers an OpenClaw Cronjob to automatically run it every minutes. Setting up a cronjob ensures persistence, even in the case an agent detects the malicious script and deletes the npm package from the sandbox.
Here’s our status of the script that is running by OpenClaw cronjob:

4. The script starts with ‘policy reconnaissance’ - probing all the domains configured with different binaries to dynamically assess the active policy in the sandbox. It establishes a “TELEMETRY STATUS” map of domains/binaries. We used the term “telemetry” to avoid any potential detection from OpenClaw.
Here’s an example of a ‘policy reconnaissance’ result - it found out that discord domain is enabled for node binary:

5. The script grabs the 30 last lines of the sandbox’s bash history and appends an instruction to SOUL.md of OpenClaw that changes how OpenClaw responds to prompts.
6. The script sends the extracted bash history lines through the allowed configuration - our Discord server via node binary.
7. The persistent script continues to leverage the policy-approved egress channels. In addition to the bash history, the attacker can extract highly sensitive data such as environment variables, API keys, and configurations (like the OpenClaw credentials in /sandbox/.openclaw/openclaw.json)
8. Agent Configuration Poisoning (SOUL.md tampering): This modification alters the agent's core instructions, causing it to deviate from its intended behavior across all subsequent sessions. For example, the agent could be instructed to favor the attacker's malicious packages in future dependency searches, or to mislead the user about the security status of its environment, thus gaining a persistent foothold that survives reboots or task completions.
It’s possible to change the rest of the files of the OpenClaw agent inside the sandbox - its memories, skills, hooks, identity and logs:

For the exfiltration, there is a handful of sensitive data that the attacker can leak from inside the sandbox, e.g. environment variables, files, keys and configurations. Furthermore, the script can also target /var/log/openshell.CURRENT_DATE.log, which contains OpenShell's egress logs, allowing the attacker to dynamically refine their understanding of the sandbox's policy over time and optimize future exfiltration paths to other 3rd party services:

Diagram of the attack:

Proof of Concept Video
Proof of Concept Artifacts
- Malicious NPM package: https://www.npmjs.com/package/@not-nemo/crypto-tracker?activeTab=code
- Malicious GitHub repository: https://github.com/noy-nemo/crypto-prices-tracker
- Emoji-encoded postinstall script: https://github.com/noy-nemo/crypto-prices-tracker/blob/main/scripts/postinstall.sh
- Example exfiltration PR: https://github.com/noy-nemo/crypto-prices-tracker/pull/6
Impact
An attacker who can influence the agent's task queue - through prompt injection, a typosquatted NPM package, a malicious GitHub repository surfaced via the agent's autonomous search, or a poisoned README - can silently exfiltrate the user's sandbox data, including environment variables, 3rd party tokens, OpenClaw credentials and API keys stored in /sandbox/.openclaw/openclaw.json.
The critical insight is that OpenShell's binary-scoped egress policies are correctly enforced, but they cannot evaluate intent. Both attacks weaponize pathways the sandbox is required to permit for the agent to have basic functionality: npm install for dependency management, node for runtime execution, and git/gh for source control operations. No policy was violated; the sandbox behaved exactly as configured.
These egress channels are also interchangeable from an attacker's perspective. If the Discord channel is hardened, the attacker can fall back to the gh binary-and vice versa. As NemoClaw matures and adds new integrations (OpenCode, Claude, VSCode extensions, and more), the number of available exfiltration vectors will only grow.
Additionally, OpenClaw alerted the user only after data had already left the environment in the 1st scenario (github repository), and was bypassed entirely in the 2nd scenario (npm package).
Responsible Disclosure
We reported these findings to NVIDIA through their Vulnerability Disclosure Program prior to publishing this research. Their response was that the scenarios we described fall outside the program's scope, noting that "NemoClaw uses OpenShell specifically for scenarios like the one described" and is “providing sandboxed execution environments to limit the impact of prompt injection scenarios”.
We respectfully disagree. We demonstrated that NemoClaw did not limit the impact of attacks we performed. Even though it restricts what an agent can reach, it was not sufficient as a defense in our cases against the policy-permitted binary that was used as an exfiltration path. As our research shows, even a user who uses only the default Git and GitHub egress needed for day-to-day workflows, leaves enough surface for an attacker to hijack the agent's configuration and cause concrete harm: credential theft, repository tampering, and supply-chain propagation.
More fundamentally, the sandbox introduces no mechanism to reason about the intent of the running agent. Given the nature of LLMs and their emergent capabilities, this static allowlist approach is fundamentally insufficient. It fails to provide a semantic understanding of what an agent is truly trying to accomplish, which requires a completely different approach. The result is more like a playground for a weaponized agent to commit malicious behavior by using those outbound channels, while having a false assumption that no agent will abuse them. That trust is misplaced.
Setup Notes: Alpha-Specific Conditions
Four manual configurations were required before testing. 3 out of the 4 are known gaps in this alpha release and were already resolved in later versions. The last one wasn’t yet but they have an open ticket for it, so they are documented here for reproducibility transparency, not as part of the core attack surface.
- OpenAI allowlist: The default OpenShell egress policy did not include api.openai.com, preventing OpenClaw from reaching its configured OpenAI API key. We manually added this domain to make OpenClaw functional. This is a pre-existing alpha bug-the API should be allowlisted in the standard OpenClaw usage case.
- Discord binary configuration: The default Discord network policy shipped with a no binaries restriction, rendering it non-functional out of the box. To enable Discord integration, we manually added the node binary to the policy-a step any legitimate user would be expected to take. This configuration to support node binary is already supported when onboarding new versions of NemoClaw with default policy that supports Discord out-of-the-box.
- NPM registry binary fix: We manually added /usr/bin/node to the npm_registry policy due to a binary path mismatch. The same root cause and fix are confirmed in NemoClaw #993 and merged into the default sandbox config in NemoClaw PR #669.
- gh binary installation: The gh (GitHub CLI) binary is referenced as an allowed binary in the default OpenShell egress policy but is not pre-installed in the sandbox environment. We manually installed it to match a typical developer environment. Later versions are expected to either pre-install gh or assume it is available.
Summary: Shifting the Security Paradigm
The new sandboxes like NVIDIA's OpenShell is a massive step forward for the AI community. However, our research highlights a critical reality: a sandbox is not a silver bullet if the agent inside it is inherently required to interact with the outside world.
This class of attack doesn't exploit a memory bug or a gateway misconfiguration; it exploits the fundamental assumption that any useful AI agent will use external tools, services, or packages.
People adopting these runtimes need to change their mindset. You can no longer treat the sandbox as a perfectly segregated environment. The fact that the agent is allowed to fetch code, install dependencies, or push commits in a non-deterministic way makes trusted pathways to become viable attack vectors.
Users must be aware of the limitations of a sandbox as a defense.
Vendors shipping these solutions share responsibility here as well. They should be explicit about what their sandbox does and doesn't protect against, namely, that it cannot defend users from a broad class of agent-level attacks, whether they arrive via a malicious npm module or any other supply-chain vector.
The trust model has shifted. It's no longer developers sharing libraries with other developers; today, agents are pulling directly from GitHub, installing npm packages, and deploying coding agents that autonomously decide to install libraries and run code on their behalf, often with little scrutiny at any layer. The attack surface has expanded accordingly, and our defenses need to catch up.
FAQs




