How to build a reliable AI agent: guardrails, evaluation, observability

AI agentsZegaware engineering16 June 202610 min read

Last updated: 16 June 2026

A reliable AI agent does the right thing, refuses the wrong thing, and leaves a trace that tells you which happened. Reliability is not a better model. It comes from three engineering disciplines wrapped around the model: deterministic guardrails with least-privilege tool access, evaluation against real tasks, and observability that traces every step. Human approval gates the consequential actions.

Most agents look reliable in a demo and behave very differently once they touch real systems, real data and real users. The gap is rarely the model. It is the engineering around the model: the controls that bound what the agent can do, the tests that prove it still works, and the traces that tell you what happened when it does not. In our audits of artificial intelligence (AI) software, the agents that fail in production are almost never the ones with the weakest model. They are the ones with the weakest scaffolding.

What "reliable" means for an agent

A reliable agent does three things: it does the right thing on the tasks it is given, it refuses or escalates the wrong thing, and it leaves enough of a record that you can tell which of the two just happened. The third part is the one teams forget. An agent that succeeds ninety-nine times and fails silently on the hundredth is not reliable, because you cannot find the failure until it has already cost you something.

An agent is not a single prompt and a single answer. It is a loop: the large language model (LLM) plans a step, calls a tool or an application programming interface (API), reads the result, and decides what to do next, often for many steps before it stops. Reliability is a property of that whole loop. A model can be accurate on a benchmark and still drive an unreliable agent, because the failures live in the steps between the model and the outside world.

This is why "use a better model" is not an answer to reliability. A stronger model changes the quality of one step. Reliability is about what happens across all of them, and about what the agent is allowed to do when a step goes wrong.

Why agents fail in production

Agents fail in ways that single-shot models do not, because they act over many steps and against systems that change underneath them. Four failure modes account for most of what we see when we audit them.

Errors compound across multi-step tasks

A small error early in a task does not stay small. Each step takes the previous step's output as its input, so an ambiguous instruction or a slightly wrong assumption propagates and grows. Arize's field analysis of production failures puts it plainly: "Small ambiguities compound fast when agents interact with live data and user constraints" [4]. A long task with a high per-step success rate can still fail often end to end, because the per-step probabilities multiply against you.

Silent failures that report success

The most expensive failures are the ones that look like success. An agent calls an API, receives a Hypertext Transfer Protocol (HTTP) 200 "OK" response carrying an empty data list, and concludes the task is done. Arize gives the canonical example: a 200 "OK" with an empty result that the agent reads as "the search worked perfectly, there is no data for this user", when the real cause was a malformed query [4]. The transport succeeded, the outcome was wrong, and nothing in the logs looks like an error.

Prompt injection through the content the agent reads

An agent that reads web pages, documents, emails or tickets is executing instructions from whatever it reads. OWASP (the Open Worldwide Application Security Project) classifies this as prompt injection, and the indirect form is the dangerous one for agents: indirect prompt injections occur when a model accepts input from external sources, such as websites or files, that then alter its behaviour in unintended ways [2]. The content the agent was sent to summarise can instruct it to exfiltrate data or call a tool it should not. We cover the mechanics in prompt injection explained.

Excessive agency

When an agent can do more than its task requires, every other failure becomes more damaging. OWASP names this Excessive Agency, the vulnerability that enables damaging actions to be performed in response to unexpected, ambiguous or manipulated outputs from a model [1]. It traces to three root causes: too much functionality (tools the agent does not need), too many permissions (broad access to downstream systems), and too much autonomy (high-impact actions taken with no verification) [1]. A compounding error or an injected instruction is contained when the agent can only read. It is a breach when the agent can also delete, pay or email.

The three disciplines that produce reliability

Reliability comes from three disciplines applied together: guardrails, evaluation and observability. None of them is a model upgrade. All of them are ordinary engineering, applied to a non-deterministic component.

Discipline	What it does	What it is not
Guardrails	Deterministic limits on what the agent can touch, plus human approval for high-impact actions	A second model asked to police the first
Evaluation	Measured success on real tasks, run before and after every change	A demo, or a subjective read of the output
Observability	A full trace of every step, tool call and decision	Logging only the final answer

Guardrails: deterministic controls, not a second model

Guardrails are the deterministic limits around the model: what tools exist, what each tool is allowed to touch, and which actions require a human. The word deterministic matters. A guardrail you can reason about is code with a known output: an allowlist, a schema check, a spend limit, a permission boundary enforced by the downstream system rather than by the agent's judgement. Asking a second model to police the first is not a guardrail, because it adds another non-deterministic component that can be wrong, or injected, in the same ways as the first.

Least privilege is the core of it. OWASP's mitigation for excessive agency is to minimise tools and apply least-privilege permissions, so the agent holds only what the task needs [1]. The UK's National Cyber Security Centre (NCSC) frames the same rule for agents directly: "give agents only the minimum access they need, for the shortest time required" [3]. In practice that means scoped credentials, narrow tool definitions, and access that is granted for a task and revoked after, not a standing key with broad rights.

High-impact actions get a human gate. OWASP is explicit: require a human to approve high-impact actions before they are taken [1]. Reading is cheap to allow. Writing, paying, deleting and sending are not.

Evaluation: measure real task success, not vibes

You cannot call an agent reliable if you have not measured whether it succeeds on real tasks. Evaluation is the discipline of building a set of representative tasks with known good outcomes, running the agent against them, and scoring whether it actually achieved the outcome, not whether it produced plausible text. This is the difference between "it looked right in the three cases I tried" and a number you can track over time.

Measurement also has to run on every change. A new prompt, a new tool, a model version bump: each can improve one case and regress ten others, and without an evaluation set you will not see it until a user does. NIST (the National Institute of Standards and Technology) builds this into its voluntary AI Risk Management Framework as one of four core functions, Measure, alongside Govern, Map and Manage, precisely because assessing and tracking AI risk has to be a standing activity rather than a launch-day check [6]. The same logic that makes us insist on tests for AI-generated code applies to the agent that orchestrates it.

Observability: trace every step

When an agent fails, you need to reconstruct exactly what it did, and that is only possible if you recorded it. Observability for agents means tracing every step: the plan, each tool call with its inputs and outputs, each decision, and the final action. A trace turns "the agent did something wrong" into "at step four it called this API, received an empty result, and misread it as success".

This is also the only defence against silent failure. The empty 200 problem is invisible at the level of "did the run finish", and obvious at the level of "what did each step return and decide" [4]. Logging the final answer tells you nothing. Tracing the path tells you everything. Observability is what makes evaluation and guardrails improvable, because it is where you find the cases worth adding to the evaluation set and the actions worth putting behind a human gate.

Human-in-the-loop for consequential actions

Human oversight is not a fallback for a weak agent. It is the design for consequential actions. The principle is to let the agent propose and a human dispose wherever an action is hard to reverse. OWASP's human-in-the-loop control for high-impact actions is the specific form of this [1], and the NCSC offers a blunt readiness test for whether an agent should be acting at all: "if you cannot understand, monitor or contain an agent's actions, it is not ready for deployment" [3].

That test is also a deployment plan. The NCSC advice is to deploy agentic AI incrementally, starting with tightly bounded pilots using clearly defined tasks [3], widening autonomy only as the evaluation numbers and the traces earn it. OWASP's Agentic Security Initiative makes the same move at the level of threats, offering a threat-model-based reference of emerging agentic threats and their mitigations so teams can reason about the new failure surface before they widen access [5]. Start narrow, measure, watch, then loosen.

Reliability is engineering discipline, not a model upgrade

It is worth saying plainly: a more capable model does not make an unreliable agent reliable. A stronger model can raise the per-step success rate, which helps, but it does not bound what the agent can touch, it does not measure whether the system still works after a change, and it does not record what happened when it fails. Those are guardrails, evaluation and observability, and they are properties of the system you build around the model, not of the model itself. In our audits, the teams with reliable agents are not the ones with the newest model. They are the ones who treated the agent as software and engineered it accordingly.

Frequently asked questions

What makes an AI agent reliable?

A reliable agent does the right thing on its tasks, refuses or escalates the wrong thing, and records enough that you can tell which happened. That comes from three disciplines around the model: deterministic guardrails with least-privilege tool access, evaluation against real tasks, and observability that traces every step. A human approves the high-impact actions.

Why do AI agents fail in production?

Agents fail because they act over many steps against systems that change. Small early errors compound across the task, some failures report success while the real outcome is wrong (an empty but valid response read as "done"), content the agent reads can inject instructions, and an over-permissioned agent turns a small error into real damage rather than a contained one.

Can a better model make an agent reliable?

No. A stronger model can raise the success rate of individual steps, but it cannot bound what the agent is allowed to touch, cannot prove the system still works after a change, and cannot reconstruct a failure after the fact. Those come from guardrails, evaluation and observability. Reliability is engineering discipline applied around the model, not a model upgrade.

Do AI agents need a human in the loop?

For consequential actions, yes. Reading data can be automated freely, but actions that are hard to reverse (paying, deleting, sending, changing production systems) should require human approval. OWASP recommends a human approve high-impact actions before they are taken, and the NCSC advises starting with tightly bounded pilots and widening autonomy only as evidence accrues.

Build agents you can rely on

Reliability is not a feature you switch on. It is the engineering you do around the model. If you are building or running agents and want senior engineers to pressure-test the guardrails, the evaluation and the observability before they meet real users, that is the work we do on AI Agents. We will tell you what is safe to let an agent do, and what needs a human in front of it, and we will put our name to the verdict.

Sources

OWASP, "LLM06:2025 Excessive Agency", Top 10 for LLM Applications 2025. https://genai.owasp.org/llmrisk/llm062025-excessive-agency/
OWASP, "LLM01:2025 Prompt Injection", Top 10 for LLM Applications 2025. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
National Cyber Security Centre, "Thinking carefully before adopting agentic AI", 15 May 2026. https://www.ncsc.gov.uk/blogs/thinking-carefully-before-adopting-agentic-ai
Aryan Kargwal, "Why AI Agents Break: A Field Analysis of Production Failures", Arize AI, 29 January 2026. https://arize.com/blog/common-ai-agent-failures/
OWASP, "Agentic AI: Threats and Mitigations", GenAI Security Project, 17 February 2025. https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/
National Institute of Standards and Technology, "AI Risk Management Framework". https://www.nist.gov/itl/ai-risk-management-framework

Want it done properly, once? We install OpenClaw isolated, hardened and monitored, then keep it updated under a plain monthly retainer. Fixed setup fee, quoted in writing.

Get set up