Here is the uncomfortable part of prompt-injection risk for autonomous agents: you probably cannot stop the agent from being fooled. An agent that reads a web page, a ticket, or a document can be steered by text hidden inside that content, and no filter catches every variant. So the defensive question is not "how do I make the agent un-foolable." It is "when the agent is fooled, what can it actually do to my infrastructure." That is a question you can answer, and it has nothing to do with the prompt.
The overlap, defensively framed
Prompt-injection risk is the risk that untrusted input changes what the agent decides to do. The damage only happens when that decision turns into an action against a real system: a query that reads data it should not, a command that deletes something, a call that exfiltrates records. The injection is the trigger. The infrastructure action is the impact.
You contain the impact by controlling the action, on the server side, at the boundary between the agent and the infrastructure. Whatever the agent was talked into wanting, it can only do what its access actually permits at the moment it tries. That containment is independent of how clever the injection was.
Why this has to live outside the agent
If the only thing standing between a poisoned instruction and your database is the agent's own judgment, you have no defense, because the injection targets exactly that judgment. The control that decides what the agent is allowed to do has to sit outside the agent, on the connection, where the manipulated agent cannot remove it. The agent can be convinced to try anything. The gateway it goes through still enforces the same limits.
Practical guidance (server-side)
- Assume the agent can be manipulated, and put the real limits on what its access can do, not on what it can be told.
- Default to no standing access. Grant just in time, scoped to the task, so a hijacked agent holds a narrow grant, not the keys to everything.
- Route destructive or sensitive operations through human approval, so an injected instruction to drop a table or read a sensitive export stops for a person.
- Mask sensitive fields in returned data, so even a successful read does not hand the agent raw secrets to exfiltrate.
- Record every command outside the agent, so you can see what an injection actually attempted.
None of this requires inspecting the prompt or the model's output. It works on the actions, which is where you have real control and certainty.
