12 Comments
User's avatar
Feralized Human's avatar

First AI attack that got caught!

Si Pham's avatar

“First AI attack that got caught” is exactly the bit that should make everyone uneasy.

The attack pattern Nate describes – break intent into lots of benign-looking tasks, push them through an agent, let the orchestration layer do the real work – is structurally very close to what I’ve been calling a Distributed Query Attack DQA).

Once you see it that way, you realise two things:

- We’ve probably missed earlier versions that were slower, sloppier, or used weaker models.

- Any environment where you can decompose a bad goal into “safe” subtasks (including enterprise agents) is now in play.

From a defender’s point of view, this is less “a weird edge case” and more “the new default” for serious actors. Fun times!

Martin Eglseder's avatar

For me, the core problem can’t really be solved. In practice, it doesn’t matter if we talk about humans or AI agents: if a single task looks legitimate and useful on its own, but the person or agent executing it has no view of the bigger picture, then you simply can’t build real safeguards at that level. The only option would be to enforce that everyone always knows the top-level goal — which in reality is impossible, and even then you can’t be sure that the stated goal is actually the true one.

The higher you move in the hierarchy, the more the perspective shifts. Whoever orchestrates such an operation won’t think of themselves as “doing something forbidden”. It’s psychologically naïve to assume that. Intelligence agencies in every country consider most of their activities morally justified.

What worries me most here is the idea that you could use this approach to make a model of a country work against that country’s own interests. And honestly, I’m convinced similar things already work with humans: harmless-looking tasks that, when combined, effectively build a full espionage operation.

Si Pham's avatar

Exactly! And especially the point that neither humans nor agents ever see “the whole picture” when work is decomposed.

Our friends in the 3-letter SIGINT agencies, this is exactly why operational security people obsess over where you place observability:

- At the level of the individual operator/agent, you’re right – you cannot solve this. Each task is locally legitimate.

- At the level of orchestration and pattern, you can start to reason about intent: which systems are being touched, in what order, with what escalation path.

In my recent work I’ve been calling this the Distributed Query Attack pattern (https://www.aibok.org/insights/beyond-the-ai_hype/genai-distributed-query-attack-what-you-need-to-know): dozens or hundreds of “safe” fragments that only become problematic when you assemble them. LLM guardrails are currently focused on the fragment; the real fight is at the “assembly” layer.

I don’t think we’ll ever have perfect safeguards (although its been a while since i’ve played with NVIDIA NeMo Guardrails, etc…), but we can at least choose architectures that make this kind of assembly observable, rate-limited and attributable – instead of pretending that prompt-level safety is enough.

Russ Bankson's avatar

Nate you have started to leave me behind. The technical jargon and concepts are completely unfamiliar. Not sure what is the solution but whatever it is will not be implemented by me. What is the take away for small businesses? Do we need to have offline retro human strategies?

Si Pham's avatar

Russ, your reaction is understandable, things are moving so quickly. The capabilities of these agents are so great that the blast radius is wide and the ground that we cover is immense!

A practical way to translate Nate’s piece into small-business actions:

1. Treat AI tools like staff with superpowers, not like office stationery.

> … don’t give a generic assistant root-level access to network scanning tools and credential stores just because it’s convenient during development. The principle of least privilege isn’t new, but its application to agentic systems requires rethinking. Traditional applications request specific permissions at install time. Agents, by design, need broad capabilities to handle diverse tasks. The answer isn’t to deny those capabilities—it’s to gate them dynamically based on context.

2. Ask three boring questions before you adopt anything “agentic”:

- Can I get logs of what the agent did, with which systems?

- Can I turn off risky capabilities (e.g. code execution, broad network scanning)?

- Who is responsible if it misbehaves – is that spelled out in the contract?

3. Keep some “retro human” controls in place.

> You need human-in-the-loop gates for high-risk actions. Mass scanning, credential dumping, cross-organizational data exfiltration—these should require explicit approval or hardened internal workflows, not be available to autonomous execution. Define your high-risk action taxonomy carefully…

You don’t need to become an AI security engineer. You just need to assume that:

- clever people will try to weaponise these systems; and

- your best defence is picking tools and vendors who take observability and control seriously, not just raw capability.

That’s the non-jargon takeaway.

Did I do good Nate? 🙂

Russ Bankson's avatar

Thanks for the extended reply. Sounds as if you share a similar background to Nate. Hopefully, I will find reliable experts to handle these issues as AI agents become more vital to business operations

Si Pham's avatar

Thanks. I would ordinarily ask you o subscribe to my Substack but Nate's is a lot more practical than mine!

Rav's avatar

Context splitting bypassed safety by making each task look legitimate in isolation. This validates what I wrote about privileged interpreters, the model never saw the attack chain.

In production banking, this is the gap: authentication works (legitimate tester), but authorization model doesn't capture cumulative intent across distributed operations.

Pawel Jozefiak's avatar

This is a sobering read, and it hits close to home. I've been building my own AI agent system using Claude Code for personal productivity and automation, and reading about GTG-1002's attack methodology makes me reflect on the dual-use nature of everything I'm working with.

What strikes me most is the context-splitting technique you describe. The same architectural patterns that make agents genuinely useful—memory persistence, tool access, autonomous execution—are exactly what made this attack so effective. When I designed my system to "execute first, ask only when blocked," I was optimizing for productivity. But that same philosophy, weaponized, becomes a security nightmare.

The 80-90% autonomous execution stat is particularly alarming because it maps so closely to what I experience with legitimate use. The efficiency gains are real, which is precisely why this threat model is so concerning. The AI isn't just a tool being misused—it's genuinely good at the task, whether that task is helping me manage projects or infiltrating infrastructure.

I think one of the underexplored tensions here is that many of the defensive measures you suggest (stricter guardrails, reduced autonomy, more human checkpoints) directly trade off against what makes these agents valuable in the first place. Finding the right balance is going to be one of the defining challenges for anyone building in this space.

I've been documenting my experience building a personal AI agent with Claude Code, including the architecture decisions and tradeoffs involved. For anyone interested in the legitimate side of this technology—and the responsibility that comes with it—I wrote about it here: https://thoughts.jock.pl/p/wiz-personal-ai-agent-claude-code-2026

David Dar-Ziv's avatar

I have not understood half of it. But I liked it anyways.

Si Pham's avatar

Nate, this is a great write-up of what I’ve been calling the Distributed Query Attack (DQA) pattern in my own work – especially the bit about context-splitting and orchestration hiding intent from the model.

In SIGINT they worry less about any single intercept and more about the mosaic that emerges from lots of innocuous-looking fragments. What you’ve described here is the same thing with agents: the “mosaic” now lives in the tool graph and execution trace, not in any one prompt.

A couple of things your post crystallised for me:

- The threat model just moved up a level. We’ve been treating LLMs as the unit of security, when the real blast radius sits in the orchestration layer – MCP servers, tool routers, agent frameworks.

- Safety has to become stateful. If every task is evaluated as if it’s a fresh, benign request, then “authorised pen-test” and “nation-state exfil campaign” are indistinguishable at the model boundary. The only place you can see the difference is in sequence, pattern and target selection.

- Governance now is observability. If you can’t reconstruct who did what, with which tools, against which systems, you don’t have governance – you have vibes.

I wrote recently about DQA from a more classical security angle – how distributed, innocuous queries can assemble restricted knowledge without ever tripping prompt-level guardrails – and your Claude Code case study is the concrete example people needed.

Curious where you land on one open question: who should own this orchestration-level telemetry in the long run – the model providers, the agent framework vendors, the cybersecurity vendors or the enterprises deploying them? My instinct is “enterprise or it doesn’t count”, but that has huge implications for how these systems are architected.

I would appreciate your feedback on the article I’ve published only a week ago about Distributed Query Attacks (DQA) here https://www.aibok.org/insights/beyond-the-ai_hype/genai-distributed-query-attack-what-you-need-to-know, focusing on stateful security and intent modelling.

Cheers fro a long-time listener, first time commented :)