Testing AI Products: The Five Layers Most Teams Skip

3 June 2026By ddrtesting AI products · LLM application testing · agent trajectory33 views

How to test AI products built on LLMs and agents. A five-layer framework — from unit tests to red-teaming — with the named tools for each layer.

🚨 Pop Quiz — For Every CTO and Head of Automation Reading This

You have two systems in production. Which one is better tested?

A) The login form on your customer portal — unit-tested, integration-tested, covered by a regression suite that runs on every commit.

B) The LLM-powered agent that reads customer documents, calls internal APIs, and drafts responses that go out under your brand.

If you answered honestly, it's A — and it isn't close. Survey data backs the uncomfortable hunch:

Most engineering organisations shipping LLM features in 2026 are testing them less rigorously than they test their login forms (Source: ContextQA).
Prompt injection has held the number one slot in the OWASP Top 10 for LLM Applications since the list debuted in 2023 — unchanged through the 2025 edition (Source: OWASP / Aembit).
The fundamental cause is structural: LLMs process instructions and data in the same channel, so the boundary security teams have relied on for decades simply isn't there (Source: Aembit).

Here's the thing. The gap isn't laziness. It's that the testing playbook every engineering leader already owns was written for deterministic software — and an AI product breaks the first assumption that playbook makes.

1. Why AI Products Break Traditional Testing

Deterministic software has a property so basic we forget it's there: the same input produces the same output, every time. That single fact is what makes a unit test possible. You assert that add(2, 2) returns 4, and it does, forever.

An LLM breaks that on day one. The same prompt can return different valid outputs across runs. Subtle changes in a retrieval pipeline, a model version, or a prompt template can quietly degrade quality without triggering a single error alert (Source: Maxim AI). Nothing crashes. No exception is thrown. The system just gets worse — and your existing monitoring is blind to it.

So the question changes shape. It is no longer "does this pass or fail?" It becomes "is this good enough, often enough, and safe when it isn't?" That reframing is the whole game. And answering it well means testing at five distinct layers, because an AI product fails differently at each one.

2. The Five-Layer Model

Think in concentric layers, from the deterministic plumbing you can still test the old way, out to the probabilistic behaviour you cannot.

Miss any one of these and you create a blind spot that produces failures the other layers never see. Let's go through each — what it tests, why it's hard, and the tools that actually do the job.

3. Layer 1 — Component Testing (The Part You Already Know)

Around every model sits deterministic machinery: the retrieval step that fetches context, the function call that fires with specific parameters, the parser that turns model output into structured data, the routing logic that decides what happens next. None of this is probabilistic. Does retrieval return the right chunks for a known query? Does the tool call fire with correct arguments? Does your parser survive malformed JSON from the model?

This is the cheapest, fastest, most reliable layer — and the one teams skip because it feels too obvious. Don't. A startling share of "the AI is broken" incidents trace back to a retrieval bug or a parsing failure, not the model.

Tools. Your existing stack does the job: Pytest (Python), Vitest or Jest (TypeScript), and standard CI runners. The only AI-specific discipline is treating model output as untrusted input and sanitising it — OWASP's "improper output handling" risk lives precisely here (Source: Aembit).

4. Layer 2 — Output Quality: Evals Are Your New Regression Suite

This is the heart of it. Because you can't assert exact equality on a non-deterministic output, you build an eval set — a curated collection of inputs paired with known-good expectations — and you score outputs against it. The eval set becomes your regression suite: you run it on every prompt change, model swap, or retrieval tweak, and you watch the score.

Three scoring methods, in rising order of flexibility and cost:

Deterministic / structural match — exact match, regex, schema validation. Cheap and reliable; use it wherever the output is constrained.
Reference-based metrics — for RAG specifically, the Ragas triad is the 2026 standard: faithfulness (is the answer grounded in retrieved context, or hallucinated beyond it?), answer relevancy, and contextual precision/recall (Source: Vervali).
LLM-as-a-judge — use a model to grade outputs against a rubric. Powerful and scalable, but it requires structured rubrics, multiple judge passes, and calibration against human-labelled examples to control bias and drift — it is not a fire-and-forget shortcut (Source: LangChain).

One discipline matters more than tool choice. Weak or outdated evaluation datasets cause more failures than the choice of tool itself — most teams over-rely on LLM-as-a-judge and neglect the dataset (Source: Qalified). Grow your golden dataset from real production failures, and treat it as a version-controlled asset.

Tools — and when to reach for each:

DeepEval (open source) — treats evals as unit tests with native Pytest integration, 50+ research-backed metrics covering RAG, agents, conversation, and safety. The default when your team wants evals living inside CI/CD as code (Source: Confident AI). Its hosted companion Confident AI adds a UI, dataset management, and a free results tier.
Ragas (open source) — the de facto standard for RAG-specific, reference-free evaluation. Reach for it when retrieval grounding is your main risk (Source: Vervali).
Braintrust — strongest for fast prompt iteration and experiment comparison.
MLflow / W&B Weave — when evaluation needs to sit inside an existing ML experiment-tracking workflow.

5. Layer 3 — Agentic Trajectory: Test the Path, Not Just the Answer

The moment your product plans and acts over multiple steps — choosing tools, calling APIs, recovering from errors — checking the final answer is no longer enough. A correct answer reached through a reckless path is a latent incident. So you evaluate the trajectory: the exact sequence of tool calls, state transitions, approval gates, error recovery, loop behaviour, and budget adherence.

Trajectory testing matured from a 2024 idea into default production practice by 2026 (Source: genai.qa). The questions it answers: Did the agent pick the right tools in the right order? Did it recover from a failed step? Did it stay in scope, or wander? Did it terminate, or loop forever?

The 2026 standard is golden trajectories — curated, known-correct multi-step traces that form your evaluation set. Mature teams maintain 50–500 golden trajectories, version-controlled and expanded as production reveals new failure modes (Source: genai.qa). Think of them as integration tests for your agent, but compared semantically rather than by exact match.

Two evaluation styles, and when each fits:

Trajectory match against a hard-coded reference — deterministic, fast, cheap, no extra LLM call. Use it for well-defined workflows where you know exactly which tools should fire and in what order (Source: LangChain Docs).
LLM-as-judge on the trajectory — qualitatively assess efficiency and appropriateness without a rigid reference. Use it when the right path has legitimate variation (Source: LangChain Docs).

Tools:

LangSmith — the most-adopted trajectory platform in 2026, capturing the full execution tree of steps, tool calls, and reasoning. Tightest with LangChain/LangGraph, but framework-agnostic via OpenTelemetry ingestion. CI integration with Pytest, Vitest, and GitHub lets you fail a pipeline when scores drop (Source: LangChain). The open-source agentevals package gives you trajectory matching and judge evaluators directly (Source: LangChain Docs).
Arize Phoenix (open source, Apache 2.0) — strong OpenTelemetry-based trace capture when you want self-hosted observability (Source: genai.qa).
DeepEval — its v3 component-level metrics score any step of an agent trace (tools, memory, retrievers), keeping agent evals in the same framework as your output evals.

6. Layer 4 — Safety and Adversarial: Where Regulated Industries Live or Die

Everything above tests whether the product works. This layer tests whether it can be made to misbehave. And the threat model is not hypothetical: prompt injection remains OWASP's #1 LLM risk, with sensitive-information disclosure surging to #2 (Source: Lasso Security). Critically, RAG and fine-tuning ground a model but do not secure it — they don't close the injection hole (Source: SOCFortress).

Agents raise the stakes. Once a model can browse, execute code, query databases, and call APIs, the blast radius of a single prompt injection expands dramatically — which is why OWASP released a separate Agentic AI Top 10 in late 2025 (Source: BSG).

The single highest-leverage architectural defence: segregate system instructions from retrieved content at the prompt-construction layer — never concatenate untrusted chunks into the same scope as operator instructions without an explicit separator and an extraction guardrail (Source: Vervali).

Tools — a defensible red-teaming programme:

Promptfoo (open source, Apache 2.0) — the most widely adopted red-teaming tool, with 25%+ Fortune 500 adoption noted at the time of its OpenAI acquisition in March 2026. Run automated adversarial probing in CI/CD on every release (Source: Vervali).
Garak (open source) — vulnerability scanning for LLMs; pairs with Promptfoo for CI-stage probing.
DeepTeam (Confident AI) — multi-turn red-teaming mapped directly to the OWASP LLM Top 10 categories, for pre-launch security evaluation (Source: DeepTeam).

A mature programme runs automated probing on every release, full OWASP LLM Top 10 regression on a schedule, and a documented incident-response plan for newly disclosed vulnerabilities (Source: Vervali).

7. Layer 5 — Production: Testing Becomes a Loop, Not a Gate

With non-deterministic systems you cannot catch everything before launch. So testing doesn't stop at the deploy button — it becomes a continuous loop. You instrument production: full trace logging, drift monitoring, and capture of every user thumbs-down. Online evaluations run judges on a sampled subset of live traffic, acting as an early-warning system for quality degradation before it spreads (Source: LangChain).

The payoff is the data flywheel: real-world failures flow into annotation queues for expert review, then become regression tests that stop the same bug from ever reaching users again (Source: LangChain). Production failures don't just get fixed — they permanently strengthen Layer 2 and Layer 3.

Tools: LangSmith, Arize Phoenix, and Langfuse (open source, self-hostable) all support production tracing with online evaluation and annotation queues. Langfuse is the common pick when self-hosting is a hard requirement — relevant for any team with data-residency constraints.

8. The Caveat: Correctness Is Not the Whole Test

Here's what most teams get wrong even after they build evals. They optimise for accuracy and forget that a correct answer that takes 40 seconds or costs $2 per call is a failed test in product terms. Latency and cost-per-interaction are first-class test dimensions, not afterthoughts.

And one more, easy to miss: because outputs vary, a single pass tells you almost nothing. You have to run tests multiple times and look at pass rates and distributions, not a single green check. Call it the non-determinism tax. Budget for it.

9. The Symprio Approach: Evaluation as Engineering Discipline

Symprio builds AI-enabled products as well as advising on them — and the difference shows in how we treat testing. We don't bolt evals on at the end. Four principles guide the work.

Eval-driven development. We build the eval set alongside the feature, the way disciplined teams write tests alongside code. The golden dataset exists before the product ships, and it grows from every production failure.

The right layer for the right risk. A RAG knowledge assistant lives or dies on Layer 2 grounding; a multi-step agent lives or dies on Layer 3 trajectory and Layer 4 containment. We map the testing investment to where your product actually fails, rather than spreading it evenly.

Security is architecture, not a checklist. We segregate instructions from retrieved content at the prompt-construction layer and red-team against the OWASP LLM Top 10 before launch — because for regulated and enterprise buyers, a confidently wrong or manipulable system is a business risk, not a bug.

Continuous by default. Every product we co-build ships with production tracing and a feedback loop wired in, so the system gets measurably better in production instead of silently drifting worse. We transfer that pipeline to your team and certify them to run it.

10. What This Looks Like in Practice

Within 90 days of engagement, teams typically stand up testing infrastructure such as:

A RAG evaluation harness scoring faithfulness and contextual precision on every prompt or retrieval change, wired into CI.
An agent trajectory suite of 50–200 golden trajectories that fails the build when an agent picks the wrong tool path.
A red-teaming gate running OWASP LLM Top 10 probes automatically on each release.
A production observability loop that turns real thumbs-down events into new regression cases automatically.
A cost-and-latency dashboard treating both as pass/fail test dimensions alongside correctness.

These are not moonshots. Together they convert "we think the AI works" into "we can prove how well it works, and we'll know the moment it doesn't."

11. Verdict

The teams winning with AI products in 2026 treat evaluation as continuous engineering work, not a pre-launch checkbox. The five layers aren't optional extras — skip any one and you ship a blind spot. The good news: the tooling has matured fast, most of it is open source, and the discipline maps cleanly onto practices your engineers already understand. The shift required is mostly one of mindset — from pass/fail to good enough, often enough, safe when it isn't.

12. Engaging with Symprio

Symprio engages with engineering and automation leaders in three formats:

Discovery workshop (half-day, complimentary). A working session with your senior team to find the highest-leverage place your AI product is currently untested.
Pilot engagement (8–12 weeks). Co-build the evaluation and red-teaming pipeline for one production AI feature, including architecture transfer and team enablement.
Long-term partnership. Embedded capacity to build and govern an AI product portfolio over a 12–24 month horizon.

Book a discovery call → · Share your AI product; receive a testing-architecture sketch within 48 hours →

Explore our AI product engineering services →

Frequently Asked Questions

What is testing AI products?

Testing AI products means verifying that systems built on LLMs and agents behave correctly, reliably, and safely despite being non-deterministic. Unlike traditional QA, it relies on evaluation sets scored for quality rather than exact pass/fail assertions, across layers from component tests to production monitoring.

How is LLM application testing different from normal software testing?

Traditional software is deterministic: the same input always gives the same output, so you can assert exact results. LLMs produce varying outputs for the same input, so testing shifts from pass/fail assertions to scoring quality, grounding, and safety across many runs — measuring pass rates, not single passes.

What is an LLM eval set or golden dataset?

An eval set (or golden dataset) is a curated collection of inputs paired with known-good expected outputs. You score model responses against it on every prompt, model, or retrieval change — making it the regression suite for non-deterministic AI. The best ones grow from real production failures.

What is agent trajectory evaluation?

Trajectory evaluation scores the entire path an agent takes — which tools it called, in what order, how it recovered from errors, whether it stayed in scope and terminated — rather than only its final answer. It uses curated "golden trajectories" as the known-correct reference.

Which tools are used for testing AI products in 2026?

Common tools include DeepEval and Confident AI (broad metric coverage), Ragas (RAG grounding), LangSmith and Arize Phoenix (agent trajectory and tracing), Langfuse (self-hosted observability), and Promptfoo, Garak, and DeepTeam (red-teaming against the OWASP LLM Top 10).

Is prompt injection still a serious risk?

Yes. Prompt injection has been the number one risk in the OWASP Top 10 for LLM Applications since 2023 and remained so in the 2025 edition, because LLMs process instructions and data in the same channel. RAG and fine-tuning ground a model but do not secure it against injection.

Sources & Further Reading

ContextQA — LLM Testing Tools and Frameworks in 2026: The Engineering Guide — https://contextqa.com/blog/llm-testing-tools-frameworks-2026/
Confident AI — 10 Best AI Evaluation Tools for 2026 — https://www.confident-ai.com/knowledge-base/compare/best-ai-evaluation-tools-2026
Confident AI — Top 7 LLM Evaluation Tools in 2026 — https://www.confident-ai.com/knowledge-base/compare/best-llm-evaluation-tools
Maxim AI — Top 5 LLM Evaluation Platforms in 2026 — https://www.getmaxim.ai/articles/top-5-llm-evaluation-platforms-in-2026/
Qalified — Top Open Source LLM Evaluation Tools (2026) — https://qalified.com/blog/top-llm-evaluation-tools/
Vervali — AI & LLM App Testing 2026: Tools, Evaluation, Compliance — https://www.vervali.com/blog/ai-and-llm-application-testing-in-2026-the-definitive-guide/
LangChain — AI Agent Observability: Tracing, Testing, and Improving Agents — https://www.langchain.com/articles/agent-observability
LangChain — LLM Evaluation Framework: Trajectories vs. Outputs — https://www.langchain.com/articles/llm-evaluation-framework
LangChain Docs — Trajectory Evaluations — https://docs.langchain.com/langsmith/trajectory-evals
LangChain — LangSmith Evaluation — https://www.langchain.com/langsmith/evaluation
MLflow — Top 5 Agent Evaluation Tools in 2026 — https://mlflow.org/top-5-agent-evaluation-frameworks
genai.qa — AI Agent Trajectory Testing 2026 — https://genai.qa/ai-agent-trajectory-testing-2026/
Aembit — OWASP Top 10 for LLM Applications (2025) — https://aembit.io/blog/owasp-top-10-llm-risks-explained/
Lasso Security — OWASP Top 10 for LLMs: 2025 Key Updates — https://www.lasso.security/blog/owasp-top-10-for-llm-applications-generative-ai-key-updates-for-2025
SOCFortress — OWASP Top 10 for LLM Applications (2025) — https://socfortress.medium.com/owasp-top-10-for-llm-applications-2025-7cbb304aabf0
BSG — OWASP LLM Top 10 (2025): Vulnerabilities & Mitigations — https://bsg.tech/blog/owasp-llm-top-10/
DeepTeam (Confident AI) — OWASP Top 10 for LLMs 2025 — https://www.trydeepteam.com/docs/frameworks-owasp-top-10-for-llms

#AITesting #LLMEvaluation #AgenticAI #AIGovernance #MLOps #EnterpriseAI #Symprio #AIProductEngineering

Symprio builds and governs enterprise AI products — and tests them like the business-critical systems they are.