There is without a doubt a variation of this prompt you can pre-test to successfully bait the LLM into exfiltrating almost any data on the user's machine/connected accounts.
That explains why you would want to go out and buy a mac mini... To isolate the dang thing. But the mini would ostensibly still be connected to your home network. Opening you up to a breach/spill over onto other connected devices. And even in isolation, a prompt could include code that you wanted the agent to run which could open a back door for anyone to get into the device.
Am I crazy? What protections are there against this?
Nothing that will work. This thing relies on having access to all three parts of the "lethal trifecta" - access to your data, access to untrusted text, and the ability to communicate on the network. What's more, it's set up for unattended usage, so you don't even get a chance to review what it's doing before the damage is done.
“Exploit vulnerabilities while the sun is shining.” As long as generative AI is hot, attack surface will remain enormous and full of opportunities.
For example I would love for an agent to do my grocery shopping for me, but then I have to give it access to my credit card.
It is the same issue with travel.
What other useful tasks can one offload to the agents without risk?
Social, err... Clanker engineering!
LLMs obviously can be controlled - their developers do it somehow or we'd see much different output.
Control all input out of it with proper security controls on it.
While not perfect it aleast gives you a fighting chance when your AI decides to send a random your SSN and a credit card to block it.
Claude code asks me over and over "can I run this shell command?" and like everyone else, after the 5th time I tell it to run everything and stop asking.
Maybe using a credit card can be gated since you probably don't make frequent purchases, but frequently-used API keys are a lost cause. Humans are lazy.
This is something computers in general have struggled with. We have 40 years of countermeasures and still have buffer overflow exploits happening.
A buffer overflow has nothing to do with differentiating a command from data; it has to do with mishandling commands or data. An overflow-equivalent LLM misbehavior would be something more like ... I don't know, losing the context, providing answers to a different/unrelated prompt, or (very charitably/guessing here) leaking the system prompt, I guess?
Also, buffer overflows are programmatic issues (once you fix a buffer overflow, it's gone forever if the system doesn't change), not an operational characteristics (if you make an LLM really good at telling commands apart from data, it can still fail--just like if you make an AC distributed system really good at partition tolerance, it can still fail).
A better example would be SQL injection--a classical failure to separate commands from data. But that, too, is a programmatic issue and not an operational characteristic. "Human programmers make this mistake all the time" does not make something an operational characteristic of the software those programmers create; it just makes it a common mistake.
That's the hard part: how?
With the right prompt, the confined AI can behave as maliciously (and cleverly) as a human adversary--obfuscating/concealing sensitive data it manipulates and so on--so how would you implement security controls there?
It's definitely possible, but it's also definitely not trivial. "I want to de-risk traffic to/from a system that is potentially an adversary" is ... most of infosec--the entire field--I think. In other words, it's a huge problem whose solutions require lots of judgement calls, expertise, and layered solutions, not something simple like "just slap a firewall on it and look for regex strings matching credit card numbers and you're all set".
Such a supervisor layer for a system as broad and arbitrary as an internet-connected assistant (clawdbot/openclaw) is also not an easy thing to create. We're talking tons of events to classify, rapidly-moving API targets for things that are integrated with externally, and the omnipresent risk that the LLMs sending the events could be tricked into obfuscating/concealing what they're actually trying to do just like a human attacker would.
While I agree that SQL injection might be the technically better analogy, not looking at LLMs as a coding platform is a mistake. That is exactly how many people use them. Literally every product with "agentic" in the title is using the LLM as a coding platform where the command layer is ambiguous.
Focusing on the precise definition of a buffer overflow feels like picking nits when the reality is that we are mixing instruction and data in the same context window.
To make the analogy concrete: We are currently running LLMs in a way that mimics a machine where code and data share the same memory (context).
What we need is the equivalent of an nx bit for the context window. We need a structural way to mark a section of tokens as "read only". Until we have that architectural separation, treating this as a simple bug to be patched is underestimating the problem.
Absolutely.
But the history of code/data confusion attacks that you alluded to in GP isn’t an apples-to-apples comparison to the code/data confusion risks that LLMs are susceptible to.
Historical issues related to code/data confusion were almost entirely programmatic errors, not operational characteristics. Those need to be considered as qualitatively different problems in order to address them. The nitpicking around buffer overflows was meant to highlight that point.
Programmatic errors can be prevented by proactive prevention (e.g. sanitizers, programmer discipline), and addressing an error can resolve it permanently. Operational characteristics cannot be proactively prevented and require a different approach to de-risk.
Put another way: you can fully prevent a buffer overflow by using bounds checking on the buffer. You can fully prevent a SQL injection by using query parameters. You cannot prevent system crashes due to external power loss or hardware failure. You can reduce the chance of those things happening, but when it comes to building a system to deal with them you have to think in terms of mitigation in the event of an inevitable failure, not prevention or permanent remediation of a given failure mode. Power loss risk is thus an operational characteristic to be worked around, not a class of programmatic error which can be resolved or prevented.
LLMs’ code/data confusion, given current model architecture, is in the latter category.
Proactive prevention (like bounds checking) only "solves" the class of problem if you assume 100% developer compliance. History shows we don't get that. So while the root cause differs (math vs. probabilistic model), the failure mode is identical: we are deploying systems where the default state is unsafe.
In that sense, it is an apples-to-apples comparison of risk. Relying on perfect discipline to secure C memory is functionally as dangerous as relying on prompt engineering to secure an LLM.
I also think that if we’re assessing the likelihood of the entire SDLC producing an error (including programmers, choice of language, tests/linters/sanitizers, discipline, deadlines, and so on) and comparing that to the behavior of a running LLM, we’re both making a category error and also zooming out too far to discover useful insights as to how to make things better.
But I think we’re both clear on those positions and it’s OK if we don’t agree. FWIW I do strongly agree that
> Relying on perfect discipline to secure C memory is functionally as dangerous as relying on prompt engineering to secure an LLM.
…just for different reasons that suggest qualitatively different solutions.
The problem simply put is as difficult as:
Given a human running your system how do you prevent them damaging it. AI is effectively thr same problem.
Outsourcing has a lot of interesting solutions around this. They already focus heavily on "not entirely trusted agent" with secure systems. They aren't perfect but it's a good place to learn.
You trust the configuration level not the execution level.
API keys are honestly an easy fix. Claude code already has build in proxy ability. I run containers where claude code has a dummy key and all requestes are proxied out and swapped off system for them.