Start now →

Why Agent logs aren’t audit evidence — what’s missing in current AI observability

By Punit Shah · Published May 8, 2026 · 12 min read · Source: Fintech Tag
SecurityAI & Crypto
Why Agent logs aren’t audit evidence — what’s missing in current AI observability

Why Agent logs aren’t audit evidence — what’s missing in current AI observability

Punit ShahPunit Shah10 min read·Just now

--

Part 1 of a three-part series on agent observability and governance for regulated industries and more.

Disclosure: This post was developed with AI assistance for structure and editing. The framework, examples, and judgment are mine.

Across client teams I work with, the same pattern keeps repeating. A board mandate comes down: we need an agentic AI strategy. Engineering teams move quickly. Within a quarter there are POCs — claims processing, customer service, fraud investigation, loan underwriting. Within two quarters there are demos that work. Within three quarters, almost nothing has reached production.

This isn’t an isolated observation. Across the 2026 industry research I follow, the same pattern keeps appearing: enterprises have been moving fast on agentic POCs and slowly on production deployment, with governance and observability gaps consistently named as the top reasons. Deloitte’s 2026 State of AI in the Enterprise survey of more than 3,200 leaders found that only one in five companies had a mature governance model for autonomous agents. Gartner has projected that over 40% of agentic AI projects will be cancelled by 2027, citing governance gaps, unclear ROI, and runaway costs. The directional message is consistent across sources: the gap between demo and production is wider than at any equivalent point in enterprise software adoption and the things bridging that gap aren’t models, they’re governance, observability, and operational infrastructure.

It’s not the technology that’s stalling these deployments. The models are capable. The frameworks are mature. What stalls them is a question that gets asked somewhere between the demo and the production review, usually by a senior engineer or a compliance officer who has seen what’s coming:

If this agent makes a thousand decisions a month, and a regulator asks us to defend any one of them, can we?

The answer comes back ambiguous. So the deployment moves to a holding pattern. POCs accumulate. Production stays small.

This isn’t a failure of agentic AI. It’s a failure of the layer above it. The observability and governance infrastructure for agents hasn’t caught up with the engineering progress, and in regulated industries the gap is starting to bite.

This series is about that gap. Three posts. This first one diagnoses what’s actually missing in current agent observability, not so we can complain about it, but so we can think clearly about what would replace it.

The question your tools can’t answer

Consider the kind of question a regulator might reasonably ask once an AI agent has been deployed at any scale:

Your sanction agent processed 840 decisions affecting customers over the past three months. Outcomes for customers aged 45+ were 23% lower than for customers aged 18–35. Which decisions do you stand behind? What controls failed? Show us the evidence.

This is illustrative, not a real case. But it’s directionally correct. RBI’s Master Direction on IT Governance, IRDAI’s claims-processing requirements, and the FCA’s Consumer Duty all push toward exactly this kind of probing. The EU AI Act’s high-risk system requirements go further. The question isn’t whether regulators will ask. It’s how soon, and how often.

What does answering this question actually require?

Now consider what most teams have available to answer it. They have logs from LangSmith or Langfuse. They have dashboards. They have prompt traces and tool-call records. They have, in most cases, an engineer who can write SQL.

What they don’t have is a system that takes the regulator’s question and produces an answer in the time the regulator expects. The workflow is mostly manual, mostly built on engineer time, and mostly not designed for cross-examination.

That gap between what the tools produce and what the question requires is the subject of this post.

Press enter or click to view image in full size

The first two gaps are technical

The technical gaps reflect what current observability tools weren’t built to do. They aren’t failures of those tools; they’re a consequence of building tools for a class of system that didn’t yet exist when the tools were designed.

Gap 1: Non-determinism breaks the audit assumption

The deepest issue with agent observability has nothing to do with what gets logged. It’s that an agent’s behavior is non-deterministic by design.

Traditional software is auditable because given the same inputs and state, it produces the same outputs. Banking systems, payment rails, and core compliance workflows are built on this foundation. Agents break that assumption. The same customer query, the same data, the same prompt produces different execution paths on different runs. NIST’s AI Risk Management Framework recognizes this directly generative systems are treated as lifecycle risks requiring ongoing measurement, not one-time certification.

Recent research has quantified what practitioners have suspected. A study from IBM’s Financial Services Market group (Souren et al., 2025) examined output consistency across five model architectures on regulated financial tasks. Smaller models (7–8B parameters) achieved 100% output consistency at temperature zero. Larger frontier models exhibited consistency rates as low as 12.5% — under identical configurations. Even at temperature zero, batch variance and infrastructure-level effects produce non-deterministic outputs. Anthropic publicly disclosed in September 2025 that a miscompiled sampling algorithm had been producing anomalies on certain batch sizes. Academic work on “audit replay failure” has begun to formalize the same observation: an examiner who asks for a flagged decision to be re-run cannot be assured that the re-run reflects systematic behavior rather than sampling variance.

What this means for governance: when an agent fails in production, you cannot reliably reproduce the failure to investigate it. The failure exists once, in one trace, and the system moves on. You cannot run the equivalent of a regression test. You cannot guarantee the exact conditions that led to the unexpected outcome. Your only artifact is the original trace, which captures what happened but not the distribution of behavior the system was capable of producing.

This is not a problem better logging solves. It’s a problem the conceptual layer above logs has to solve — through statistical baselines, distribution-based monitoring, and the ability to characterize agent behavior across many runs rather than just any single one.

Gap 2: Multi-step failures hide in plain sight

A single agent task often involves ten to twenty steps: LLM calls, tool invocations, retrievals, validations, and decision points. Each step looks fine in isolation. The prompts are reasonable. The tool calls succeed. The responses parse correctly. But the trajectory the agent takes to reach the final outcome is where the failure lives.

Did the agent take seven steps to do what should take three? Did it loop on a verification step that kept returning ambiguous results? Did it skip an escalation that policy required? Did it use a tool in a context the tool wasn’t designed for? Did it invent a new sequence of operations that no human reviewer has ever seen?

Most current observability platforms log each call cleanly. They don’t analyze the shape of the call sequence. The industry is increasingly recognizing this: agent failures appear in multi-step causal chains, not at the individual call level, and they require full-session trace capture combined with sequence analysis to detect. The detection problem is fundamentally about encoding a sequence of events as data and analyzing its structure for anomalies — which is a different kind of engineering than logging individual events well.

In practice, this means most multi-step failures are caught by accident. Someone notices an outcome that looks wrong, traces back through the logs, and reconstructs what happened. The reconstruction takes hours. The original anomaly may have been running for weeks before anyone looked.

These two gaps non-determinism and multi-step trajectory analysis — are what I’d call technical gaps. They reflect what observability tools would need to do to handle agents well, beyond what they were built for. I’ll come back to both in detail in the next post in this series, where I walk through the framework I think actually addresses them.

The next two gaps are institutional

The technical gaps are about what tools can detect. The institutional gaps are about how organizations operate. They are arguably the bigger problem, because they don’t get solved by buying better software.

Gap 3: Logs describe what happened — they don’t prove it

When a regulator asks for evidence, “we have the logs” is not, on its own, a defense. A trace shows that the agent called a tool with certain parameters and received a certain result. It does not show whether the agent called the correct endpoint (versus a test environment), whether the parameters were correct (versus copied from a previous request and silently mutated), or whether the downstream system actually executed the requested action (versus returning an error the agent chose to ignore).

Standard observability output — JSON traces, span IDs, dashboard snapshots answers the question “what did the system record?” Audit evidence has to answer a stronger question: “what actually happened, and how do we know the record is true?” These are different epistemic standards. The first is about visibility. The second is about provenance, integrity, and the ability to defend the record under cross-examination.

Most engineering teams underestimate this gap because their tools genuinely do work for engineering purposes. The dashboards their teams trust are accurate enough for the engineers’ jobs. But trust within the engineering team and trust under regulatory scrutiny are different problems with different requirements. A regulator does not accept LangSmith screenshots. A judge does not accept an engineer’s spreadsheet of pulled rows.

The bridge between observability output and regulatory artifact is currently filled by manual reconstruction — compliance staff writing reports, sometimes well, sometimes poorly, always slowly. There is no path today from raw trace data to court-admissible evidence that does not go through significant human intervention.

Gap 4: Investigation is still gated on engineering time

Even with good logging in place, the workflow from “regulator asks a question” to “compliance team has an answer” runs on engineer time. The pattern is consistent across every team I’ve seen: a regulator or internal compliance officer asks something specific. Engineers write SQL queries. Compliance staff wait. Spreadsheets are built. Reports are drafted. The cycle takes one to two weeks per question.

This works when questions are rare. It collapses when questions become routine — which is exactly what happens once an agent system handles thousands of decisions a month and the regulator wants periodic reviews. There is no system today that takes a question like “show me all decisions affected by the prompt change in week 14, segmented by demographic, with root-cause attribution” and produces an answer in hours rather than weeks.

The deeper issue is structural. As long as audit response is gated on engineer availability, agent governance is fragile. Compliance teams are forced to choose between asking fewer questions (and accepting reduced oversight) or accepting longer cycle times (and falling behind regulatory expectations). Neither is a sustainable position once enforcement intensifies.

These two gaps provenance and investigation workflow — are what I’d call institutional. They are about how compliance functions and regulators interact with AI systems, not about what the systems themselves can detect. I’ll come back to both in the third post in this series, where I argue that current regulatory direction is moving toward exactly the requirements current institutional infrastructure can’t meet.

What current tools document

It’s worth being precise about how the current observability tooling market handles these capabilities. The table below maps a set of capabilities relevant to regulated-industry agent observability against what each major platform documents publicly. This is a documentation comparison, not a hands-on evaluation based on each platform’s published feature documentation as verified in May 2026.

The point isn’t to rank tools. It’s to surface that the capabilities required for deployment are not yet a default expectation in the agent observability market. Most of these tools are excellent at what they were designed to do. They were not designed to answer the regulator’s question.

Press enter or click to view image in full size

What the table makes clear: the engineering observability layer is mature. Agent decision logging, multi-step trace capture, real-time SLA alerts — all standard. Several platforms now offer some form of root-cause assistance, and Arize Phoenix in particular documents continuous drift detection across model behavior. Datadog explicitly markets cluster visualization to identify drift, prompt injection scanners, and dedicated agent monitoring.

Where the picture thins is in the specific capabilities a regulated-industry compliance function needs. Tool-call sequence analysis is logged by all four LLM-observability platforms, but none publicly document analyzing the structure of those sequences for anomalies. Demographic outcome monitoring as a real-time, alert-driven capability is rare. Regulator-format evidence export — the artifact that compliance teams actually need to put in front of an auditor — is essentially absent from LLM-observability tools and remains the province of legacy GRC platforms that don’t have the agent context.

These are not failings of the tools. They are signals about where the market currently is. Engineering observability is mature. Regulatory observability is, for the moment, a gap that sits in the seam between two tool categories.

What’s coming next

The diagnosis above is the easy part. The harder question is what fills the gap.

In the next post in this series, I’ll walk through the architecture I think actually works — a layered approach to agent observability designed for the question the regulator asks, not just the question the engineer asks. I’ll explain the five categories of monitoring that need to live above the log layer, what makes each of them hard, and why the answer isn’t “better logs” or “another dashboard” but a different conceptual layer entirely.

In the third post, I’ll connect the framework back to active regulatory direction — what RBI, IRDAI, the FCA, and the EU AI Act actually require, what enforcement is likely to look like, and what AI leaders in regulated industries should be doing now rather than after the first audit cycle.

If you’re building agent systems for regulated environments, the gap I’ve described in this post is probably already on your roadmap somewhere — even if it’s not at the top. My argument across this series is that it should be near the top, and that the time to design for the regulator’s question is before the regulator asks.

This article was originally published on Fintech Tag and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →