AI Agents in Financial Services: The Promise Is Real, But the Foundation Is Cracked
Naveen Sundaresan12 min read·Just now--
Before we hand the keys to the trading floor, we need to fix hallucination, data leakage, and the illusion of auditability
On 5 May 2026, Anthropic released ten preconfigured Claude agents for financial services, covering pitchbook drafting, KYC screening, financial modeling, month end close, valuation review, statement audit, general ledger reconciliation, and related work, deployable in days rather than months [1]. Customer testimonials accompanied the launch from Citadel, BNY, Carlyle, Mizuho, FIS, Travelers, Walleye Capital, and Hg. The accompanying Claude Opus 4.7 model was reported to lead the Vals AI Finance Agent benchmark at 64.37 percent [2]. A partner ecosystem spanning FactSet, S&P Capital IQ, MSCI, PitchBook, Morningstar, LSEG, Daloopa, and a new Moody’s MCP app, alongside fresh connectors from Dun & Bradstreet, Guidepoint, IBISWorld, SS&C Intralinks, Third Bridge, and Verisk, lent the package considerable institutional credibility.
Welcome to the age of the financial AI agent.
The excitement is understandable. The workflow pain in financial services is real. Analysts are buried in repetitive tasks, compliance teams are overwhelmed by document volume, and operations teams still run manual reconciliation processes that belong in another era. If AI agents can compress that burden, the value creation is enormous.
Yet there is a pattern in enterprise technology adoption that repeats with uncomfortable regularity. The demonstration is flawless, the pilot is promising, and the production deployment is where reality asserts itself. For financial AI agents, reality has three sharp edges that the current hype cycle is smoothing over: hallucination, data leakage, and the illusion of auditability.
Until these risks are demonstrably controlled, not merely acknowledged or loosely mitigated, the promise of financial AI agents will remain exactly that.
A note on framing before we proceed. Most commentary on AI in finance is consequentialist. It weighs efficiency gains against operational risks and concludes that the math favours adoption when the risk function can be controlled. That arithmetic is incomplete. Financial institutions also operate under a deontological constraint that is prior to the efficiency calculus, namely fiduciary duties to clients, statutory obligations to regulators, and an institutional duty of explainability to society. These obligations do not bend to the cost of meeting them. The duty to govern is prior to the right to deploy.
Problem One: Hallucination Is Not a Minor Bug
In consumer applications, hallucination is an inconvenience. In financial services, it is a regulatory and reputational event.
Large language models, the engines powering every agent currently being marketed to financial institutions, do not retrieve facts. They generate text that is statistically consistent with their training data [3]. The distinction matters in practice.
Ask an agent to summarise an earnings call transcript. It will produce something that reads as authoritative and structured. But if the transcript contains an ambiguous figure, say a revenue number stated differently in the prepared remarks and the Q&A, the model will resolve that ambiguity silently, choosing one interpretation without flagging the conflict. The output looks clean. The underlying judgment is invisible.
Now place that summary into a pitchbook that goes to a client, a credit memo that goes to a risk committee, or a regulatory filing.
The financial sector has seen what happens when numbers in client documents are wrong. Consequences range from client complaints to regulatory censure to litigation. In an environment where even small valuation errors can have material consequences, the tolerance for unchecked generative error is extremely low.
Current models have made significant progress on factual accuracy. But hallucination has not been eliminated. It has been reduced and made less predictable, which in some ways is more dangerous. A model that hallucinated consistently could be caught in testing. A model that hallucinates occasionally, unpredictably, and in ways that look plausible is a compliance team’s nightmare.
The 64.37 percent figure on the Vals AI Finance Agent benchmark deserves particular scrutiny. It is a real technical achievement and currently leads the industry. But in a production environment processing ten thousand documents per day, the corresponding error rate of nearly 36 percent translates to roughly 3,600 errors requiring human detection every single day. Even at 90 percent accuracy, the same volume produces a thousand errors daily, each of which must be caught by a reviewer. If humans are reviewing every output anyway, what exactly has been automated?
Problem Two: Data Leakage in a World of Client Confidentiality
Financial institutions sit on some of the most sensitive data in the global economy: client portfolios, M&A mandates, credit assessments, personal financial information, regulatory submissions. The legal and ethical obligations around this data are extensive, complex, and jurisdiction specific.
AI agents need data to function. They need context, specifically documents, transaction records, client files, and historical data, to generate useful outputs. The more context they have, the better the output. This creates a fundamental tension with data governance.
The risk of data leakage manifests in three distinct ways.
The first is training data contamination. When an institution fine tunes a model on its own data, a common approach to improving domain performance, there is a risk that client specific information becomes encoded in the model weights. That data can then surface in outputs generated for different users, in ways that are difficult to detect and nearly impossible to audit after the fact [4].
The second is cross context contamination. In agentic systems where a model handles multiple tasks across a session, information from one task can influence outputs in another. An agent that processed a confidential M&A document earlier in a session may inadvertently surface details from that document when answering an apparently unrelated query. The model is not malicious. It is doing what language models do, which is using all available context to generate a response.
The third is third party model risk, and this is where the regulatory exposure is sharpest, particularly for institutions operating in Singapore.
Here is the governance problem that few AI deployment conversations are addressing directly. Under MAS’s revised Outsourcing Guidelines, which took effect in December 2024, routing client data through a third party AI provider’s infrastructure may no longer be viewed as a simple software procurement decision. Depending on the sensitivity of the data, the criticality of the workflow, and the institution’s reliance on the provider, such an arrangement may need to be assessed through the lens of outsourcing governance, including due diligence, contractual security controls, supply chain oversight, and outsourcing register requirements [15].
Some institutions may be tempted to classify these arrangements as tool subscriptions rather than outsourcing relevant services. That classification may become difficult to defend if the AI agent handles sensitive client data, supports critical workflows, or creates operational dependency.
MAS conducted a thematic review in mid 2024, covering banks’ AI and generative AI model risk management practices. The resulting information paper, published in December 2024, sets out good practices observed during the review across areas such as AI governance and oversight, AI identification and inventory, risk materiality assessment, model development, validation, deployment, monitoring, third party AI solutions, and additional considerations for generative AI [13]. These themes are directly relevant to financial AI agents because agentic systems often combine model output, enterprise data access, tool use, third party infrastructure, and human review within a single workflow.
MAS’s November 2025 consultation paper on AI Risk Management Guidelines goes further, proposing clearer supervisory expectations for the use of third party AI providers. Institutions are required to assess the transparency of external AI vendors on data handling, model and cybersecurity risks, and to implement compensatory controls where that transparency is not available [14]. Where a vendor cannot adequately explain how client data is isolated, retained, or protected across jurisdictions, the institution bears the residual risk.
The jurisdictional surface area of a typical AI agent deployment illustrates why this matters analytically. Consider a Singapore bank routing client data to a US based AI provider for document analysis. That single workflow simultaneously intersects MAS FEAT principles, Singapore’s PDPA, MAS Outsourcing Guidelines, MAS AI Risk Management Guidelines, GDPR if any EU client data is involved, and potentially SEC requirements if the institution has US operations [5][6]. Each framework carries different breach notification obligations, different data residency expectations, and different standards for what constitutes adequate vendor oversight. Regulators assess outcomes, not intentions. Contractual provisions with a vendor do not extinguish the institution’s own regulatory obligations.
This is not an argument for paralysis. It is an argument for treating AI agent data governance as a material regulatory obligation from day one, not as a technical configuration problem to be resolved after deployment.
Problem Three: The Illusion of Auditability
Every financial AI agent being marketed to regulated institutions today comes with some version of an audit trail, namely full logs of tool calls, decision paths, retrieved documents, and generated outputs. This is presented as the answer to the regulatory explainability requirement, and in a technical sense it is not wrong.
But there is a meaningful difference between logging what an agent did and being able to explain why it decided what it decided.
A traditional model risk management framework, such as the Federal Reserve’s SR 11/7 in the United States, recently revised as SR 26/2 to expressly address AI and machine learning models [7], or equivalent guidance in Singapore, the UK, and Europe, requires model documentation, conceptual soundness assessment, outcome analysis, and ongoing monitoring. It requires that a human being, or panel of humans, can look at a model and understand its decision boundaries, its failure modes, and its behaviour under stress.
Large language models do not currently satisfy this framework. Not because they lack documentation, but because their decision processes are not interpretable in the way the framework assumes. An audit log that records “the model retrieved these three documents and generated this output” does not explain why those documents were weighted as they were, why alternative interpretations were rejected, or how the output would change if the input documents contained subtly different information.
Consider a model risk reviewer at a tier one bank, asked to validate an agent that drafts credit memos. She can read the audit trail. She can verify that cited figures match source documents. What she cannot do is reproduce the agent’s reasoning, articulate its failure modes under adversarial inputs, or specify with confidence the conditions under which it would generate a confidently wrong number. Under SR 11/7, that is not validation. It is observation.
The other side of the auditability problem is automation bias. Research on human oversight of automated systems consistently shows that when humans expect a system to be reliable, they review its outputs less critically [8]. The more capable the agent appears, the more dangerous this dynamic becomes. A junior analyst who would have caught a model error by working through the calculation herself will, under time pressure, approve an agent generated output she has not fully interrogated. This is not a failure of character. It is a predictable consequence of how humans interact with apparently authoritative automated systems.
The Regulatory Architecture Is Not Ready Either
It is not only the technology that needs to mature. The regulatory framework for AI agents in financial services is in early development across most jurisdictions.
The EU AI Act provides the most comprehensive framework currently in force [9], classifying AI systems used in credit scoring and certain financial decisions as high risk and imposing corresponding requirements around transparency, human oversight, and data governance. Implementation guidance is still emerging, and the gap between the regulation’s intent and enforceable operational requirements remains considerable.
The Monetary Authority of Singapore has issued the FEAT principles and the Veritas methodology for responsible AI in financial services [6], and has followed with both an information paper on banks’ AI model risk management practices in December 2024 and a consultation paper on AI Risk Management Guidelines in November 2025. If finalised, these guidelines are expected to shape supervisory expectations for financial institutions in Singapore. The Hong Kong Monetary Authority has issued more prescriptive circulars [10]. Japan’s FSA continues to evolve its approach [11]. These do not yet form a coherent whole for institutions operating across APAC.
In the United States, the picture is more fragmented. SR 11/7 was revised in April 2026 as SR 26/2 to address AI and machine learning models, a meaningful step forward, though operational implementation guidance for generative agent systems specifically remains thin. The NIST AI Risk Management Framework offers a useful supplement [12], but it is voluntary. The SEC and OCC have issued statements on AI risk without yet promulgating comprehensive binding rules. Institutions making deployment decisions today are, in many cases, making regulatory bets on what the framework will look like in three years.
This regulatory uncertainty is not an argument against deployment. It is an argument for deployment that is conservative, well governed, and prepared for the examination that will eventually come.
What Responsible Adoption Actually Looks Like
The efficiency case is real, the technology is advancing rapidly, and institutions that refuse to engage will be competitively disadvantaged. The question is not whether to adopt, but how. Responsible adoption has several non negotiable characteristics.
1. Domain isolation comes first. Agents should operate on defined, bounded data sets with clear provenance. The temptation to grant broad access in search of better context should be resisted until governance controls are demonstrably robust.
2. Human oversight must be a design principle, not a checkbox. Who reviews agent outputs? What are they reviewing for? How is reviewer fatigue managed? What escalation path exists when a reviewer is uncertain? These questions need answers before deployment, not after the first material error.
3. Adversarial hallucination testing is essential. Standard accuracy testing on representative samples is insufficient. Institutions must test for the failure modes that matter most in their domain, including factual errors that look plausible, ambiguity resolution that goes the wrong way, and confident generation from insufficient evidence.
4. Model risk frameworks need adaptation. The April 2026 revision of SR 11/7 as SR 26/2 expressly extends model risk guidance to AI and machine learning, but it remains a high level framework. Institutions must operationalise it with controls that address the specific characteristics of large language models, namely their probabilistic nature, context dependence, and opacity.
5. Vendor due diligence must be genuinely rigorous. Contractual provisions are necessary but not sufficient. Institutions need to understand their AI provider’s data handling architecture, security practices, and regulatory posture in every jurisdiction where client data is processed. Under MAS’s revised Outsourcing Guidelines, where an AI provider arrangement falls within outsourcing governance expectations, this due diligence should be formally documented and periodically reviewed.
The Longer View
The financial AI agent market is at the peak of its hype cycle. The announcements are impressive, the partnerships are credible, and the pilot results are carefully curated for maximum impact. This is how enterprise technology markets work.
What follows the hype cycle is typically a period of reckoning. A significant production failure, a regulatory enforcement action, or a series of quieter disappointments that accumulate into institutional caution. This reckoning is not inevitable, but it is the typical trajectory when powerful technology is adopted faster than the governance frameworks that should surround it.
The institutions that navigate this well will move deliberately, not slowly, but with rigour about what their agents are doing, why they are doing it, and what failure would look like. They will treat human oversight as a design constraint rather than a compliance formality. They will be the ones that can sit in front of a regulator and explain, not merely document.
The technology will improve. Hallucination rates will fall. Data governance architectures will mature. Regulatory frameworks will catch up. The agents being released today are the beginning of something significant, not the end state.
But we are not there yet. The gap between where we are and where we need to be is not primarily a technical gap. It is a governance gap, specifically a gap between the sophistication of what these systems can do and the sophistication of how we understand, control, and take accountability for what they do.
Closing that gap is the most important work in AI today. It is also, not coincidentally, the work the financial sector is best positioned to demand. It has the regulatory pressure, the risk culture, and the institutional memory of what happens when powerful systems are deployed without adequate controls.
The agents are ready to work. The question is whether the governance is ready to let them.
Disclosure: Views expressed are personal and for informational purposes only. They do not represent any employer or organisation and should not be treated as legal, financial, or professional advice.
References
[1] Anthropic. Agents for financial services. 5 May 2026. https://www.anthropic.com/news/finance-agents
[2] Vals AI. Finance Agent benchmark leaderboard. https://www.vals.ai/benchmarks/finance_agent
[3] Bender, E. M., Gebru, T., McMillan Major, A., and Shmitchell, S. (2021). On the Dangers of Stochastic Parrots. FAccT 2021. https://dl.acm.org/doi/10.1145/3442188.3445922
[4] Carlini, N. et al. (2021). Extracting Training Data from Large Language Models. USENIX Security Symposium. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting
[5] Regulation (EU) 2016/679, General Data Protection Regulation. https://eur-lex.europa.eu/eli/reg/2016/679/oj
[6] Monetary Authority of Singapore. FEAT Principles and Veritas Initiative. https://www.mas.gov.sg/publications/monographs-or-information-paper/2018/feat
[7] Federal Reserve and OCC. SR 26 2 and OCC Bulletin 2026 13. https://www.federalreserve.gov/supervisionreg/srletters/SR2602.pdf
[8] Parasuraman, R. and Manzey, D. H. (2010). Complacency and Bias in Human Use of Automation. Human Factors, 52(3), 381 to 410. https://journals.sagepub.com/doi/10.1177/0018720810376055
[9] Regulation (EU) 2024/1689, Artificial Intelligence Act. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
[10] Hong Kong Monetary Authority. Use of Artificial Intelligence in the Banking Industry. https://www.hkma.gov.hk/eng/regulatory-resources/regulatory-guides/
[11] Financial Services Agency of Japan. Discussion Paper on the Use of AI in the Financial Sector. https://www.fsa.go.jp/en/
[12] National Institute of Standards and Technology. AI Risk Management Framework, AI RMF 1.0. https://www.nist.gov/itl/ai-risk-management
[13] Monetary Authority of Singapore. AI Model Risk Management in Banks. December 2024. https://www.mas.gov.sg/publications/monographs-or-information-paper/2024/artificial-intelligence-model-risk-management
[14] Monetary Authority of Singapore. Consultation Paper on Guidelines on AI Risk Management. November 2025. https://www.mas.gov.sg/publications/consultations/2025/consultation-paper-on-guidelines-on-artificial-intelligence-risk-management
[15] Monetary Authority of Singapore. Guidelines on Outsourcing for Banks. Effective 11 December 2024. https://www.mas.gov.sg/regulation/guidelines/guidelines-on-outsourcing-banks