Back to Insights
Treat AI Agents Like Untrusted Insiders, Not Magic
·9 min read

Treat AI Agents Like Untrusted Insiders, Not Magic

Most agent security conversations start in the wrong place: with the model. Teams ask whether the LLM is 'aligned' or 'safe', as if security were a property you could buy from a foundation model vendor. DeepMind's recently published AI Control Roadmap reframes the problem correctly — you secure an agent the way you secure a contractor with system access, by limiting what it can touch and watching what it actually does. The shift from trusting the model to controlling the agent is the most important mental-model upgrade for any leader deploying agents this year.

By Daniel Usvyat · Founder & Principal, USQRD

Share

Key takeaways

  • Model alignment is necessary but not a security control — an agent that can act inside your systems should be governed like an untrusted insider, not trusted because it sounds helpful.
  • DeepMind's AI Control Roadmap pairs traditional safeguards (least privilege, sandboxing) with real-time monitoring — a defence-in-depth posture, not a single 'aligned model' bet.
  • The two failure modes that matter to a board are confused-deputy attacks (the agent tricked into misusing its access) and over-broad permissions that turn a small mistake into a large incident.
  • Security for agents is an operating-model decision, not just an engineering one: who owns the permission boundary, who watches the monitors, and who can pull the plug.
  • If you can't observe and constrain what an agent does at runtime, you don't have a deployable agent — you have a liability waiting for the wrong prompt.

Alignment Is a Hope. Control Is a Control.

The instinct to lean on model alignment is understandable — it's the thing the vendors talk about, and it feels like the model is the system. But alignment is a probabilistic property of how a model tends to behave. Security is about what happens in the worst case, when the model behaves badly, gets manipulated, or simply misunderstands an instruction with consequences.

No serious security team would grant a new contractor admin access to production on the basis that they seem trustworthy. They'd scope their permissions to the task, log their actions, and revoke access the moment something looked off. An autonomous agent that can send emails, move money, query customer data, or call internal APIs deserves exactly the same treatment — and for the same reason. The agent is a non-human actor operating inside your trust boundary, and it can be wrong or weaponised.

DeepMind's roadmap makes this explicit by combining classical safeguards — least privilege, sandboxing, access controls — with real-time monitoring of agent behaviour. That combination is the whole point. You assume the model will eventually do something you didn't intend, and you build the system so that when it does, the blast radius is small and someone notices.

You secure an agent the way you secure a contractor with system access — not by trusting it, but by constraining and watching it.

The Two Failure Modes a Board Should Care About

Strip away the jargon and agent security risk concentrates in two places. The first is the confused-deputy problem: the agent has legitimate access, and an attacker — through a poisoned document, a malicious email it's asked to summarise, or a crafted user request — tricks it into using that access against you. The agent isn't compromised in the traditional sense; it's doing what it was told, by the wrong person. Prompt injection is the headline example, and it has no clean fix at the model layer.

The second is over-broad permissions. An agent built to draft replies gets read-write access to the whole CRM because that was easier than scoping it. A coding agent gets a credential that can reach production. Now a small misjudgement — the kind any system makes — becomes a data exfiltration event or an outage. The damage isn't proportional to the mistake; it's proportional to the access.

Both failure modes are organisational as much as technical. They come from someone optimising for speed of delivery over containment, usually because no one owned the question of what this agent is actually allowed to do. The same dynamics show up in why most enterprise AI pilots die in month four — the killers are rarely the model and almost always the operating decisions around it.

  • Confused deputy: the agent is manipulated into misusing access it legitimately holds (prompt injection, poisoned context).
  • Over-broad permissions: the agent's access far exceeds its task, so a small error becomes a large incident.
  • Both are amplified by autonomy — the more steps an agent takes unsupervised, the more compounding goes wrong before anyone sees it.

Least Privilege Is the Cheapest Insurance You'll Ever Buy

The single highest-leverage decision in agent deployment is scoping. Give every agent the narrowest set of permissions that lets it do its job, and nothing more. A support-triage agent should be able to read tickets and propose responses — not delete records or issue refunds without a human in the loop. This is unglamorous, it slows down the initial build slightly, and it is the difference between an embarrassing log entry and a regulatory disclosure.

The discipline here is to design the permission boundary before you design the prompt. In our work standing up production agents, the teams that ship safely treat the agent's toolset like an API contract: every tool it can call is reviewed, every write action is gated, and high-consequence actions require explicit confirmation. The model can be as creative as it likes within a box it cannot break out of.

This is also where 'autonomous' gets sensibly bounded. Full autonomy on a low-stakes task is fine. Full autonomy on anything that moves money, touches PII, or modifies production needs a checkpoint. The right amount of autonomy is a risk decision, not a technical capability you switch on because the demo was impressive.

  • Scope permissions to the task, not the platform — default to read-only and grant write access deliberately.
  • Gate high-consequence actions (payments, deletions, external comms, PII access) behind human confirmation or hard policy checks.
  • Treat the agent's tool list as a reviewed contract, not an afterthought — every new tool is a new attack surface.

If You Can't Watch It, You Can't Ship It

The second half of DeepMind's framing — real-time monitoring — is where most teams are weakest. It's not enough to constrain an agent up front; you need to observe what it's doing while it does it, and to catch anomalous behaviour before it completes a harmful action rather than in a post-mortem. That means logging every tool call, every decision, and every input the agent consumed, in a form a human can actually review.

Monitoring serves three jobs at once. It's a security control (spotting the confused-deputy attack mid-flight). It's an operational one (knowing why the agent did something weird with a customer). And it's the foundation of improvement — you can't make an agent more reliable if you can't see where it's failing. This is the same muscle that underpins evaluation, which is why we argue the eval harness is the real deliverable, not the agent itself. The infrastructure that lets you observe and grade behaviour is what makes the system safe to iterate on.

Be honest about what's still hard here. Detecting subtle misbehaviour at scale — distinguishing a clever-but-correct action from a clever-but-malicious one — is an open problem. Real-time monitoring catches the obvious and the dangerous; it does not catch everything. The right response is layered defence: constrain so the worst case is survivable, monitor so the bad case is visible, and accept that no single layer is sufficient.

  • Log every tool call, input, and decision in a human-reviewable trail — observability is non-negotiable.
  • Put automated checks on the action stream so high-risk behaviour can be flagged or blocked before it completes.
  • Assume monitoring is partial — pair it with containment so undetected failures still have a small blast radius.

This Is an Operating-Model Question, Not a Tooling One

The uncomfortable truth for leaders is that agent security can't be delegated entirely to whoever builds the agent. The three load-bearing questions are organisational: who owns the permission boundary, who watches the monitors, and who has the authority to pull the plug when something goes wrong. If those don't have names attached, you have a gap no model will close.

In practice this means agents need to enter the same governance perimeter as any other system with access to sensitive data and actions — change control, access review, incident response. The mistake we see across engagements is treating agents as a special category that lives outside normal security process because it's 'AI'. It isn't special. It's a new kind of actor inside an old kind of risk, and your existing controls mostly apply once you decide to apply them.

For teams without senior AI ownership in place, this is precisely the kind of decision that benefits from a fractional Head of AI who has seen these failure modes in production — someone who can set the policy before the first incident sets it for you. The cost of getting the operating model right is small. The cost of discovering you didn't have one, during a breach, is not.

The Right Mental Model, and What It Doesn't Solve

DeepMind's roadmap is valuable less for any single technique and more for the posture it endorses: defence in depth for agents, built on the assumption that the model will sometimes be wrong or manipulated. Treat the agent as an untrusted insider. Scope its access tightly. Watch what it does in real time. Keep a human on the high-consequence loop. None of this is exotic — it's the security thinking your organisation already applies to people and services, pointed at a new kind of actor.

What it doesn't solve is worth saying plainly. Prompt injection remains unsolved at the model layer. Detecting sophisticated misbehaviour is hard. And there's a real tension between autonomy — the thing that makes agents valuable — and control — the thing that makes them safe. Every deployment lands somewhere on that spectrum, and pretending the tension doesn't exist is how teams end up over-trusting a system that was never designed to be trustworthy.

The actionable move is unglamorous and immediate: before you expand any agent's reach, map what it can touch, decide what it should be allowed to do unsupervised, and make sure you can see what it actually did. Containment and observability first; capability second. That ordering is what separates an agent you can defend in a board meeting from one you'll be explaining in a breach notification.

Frequently asked questions

Isn't model alignment enough to make AI agents safe?

No. Alignment reduces how often a model behaves badly, but it's a probabilistic property, not a security control. You still need least-privilege access, sandboxing, and real-time monitoring so that when the model is wrong or manipulated, the damage is contained and visible.

What is the confused-deputy problem in AI agents?

It's when an agent with legitimate access is tricked — often via prompt injection through a document, email, or user request — into misusing that access against you. The agent isn't hacked in the traditional sense; it's doing what it was told by the wrong party, which is why containment and monitoring matter more than trusting the model.

How much autonomy should we give an AI agent?

As much as the task's risk allows and no more. Full autonomy is fine for low-stakes work, but anything that moves money, touches PII, or modifies production should have a human checkpoint. Treat the level of autonomy as a risk decision, not a feature you enable because the demo was impressive.

Who should own AI agent security in our organisation?

It belongs inside your existing security and governance perimeter, with named owners for the permission boundary, the monitoring, and the kill switch. Agents are a new kind of actor inside an old kind of risk — most of your existing controls apply once you decide to treat agents like any other system with sensitive access.

Take the Operational Bottleneck Audit

Our Bottleneck Audit maps where agents touch your systems and where the permission and monitoring gaps are before they become incidents.

Ready to stop experimenting?

Secure your agents before you scale them

We help Series A+ and enterprise teams ship production agents with least-privilege and real-time monitoring built in from day one — not bolted on after the first incident.

Book a Discovery Call

More insights