Start now →

How financial services evaluate LLMs and SLMs in a practical framework

By Punit Shah · Published April 28, 2026 · 9 min read · Source: Fintech Tag
AI & Crypto
How financial services evaluate LLMs and SLMs in a practical framework

How financial services evaluate LLMs and SLMs in a practical framework

Punit ShahPunit Shah7 min read·Just now

--

A working framework from production, not a benchmark leaderboard.

Disclosure: This post was developed with AI assistance for structure and editing. The framework, examples, and judgment are mine.

A few months ago I sat in a workshop where a team had spent 3 weeks evaluating models for a single use case and still couldn’t say which one they’d ship. They had spreadsheets of MMLU scores, GPQA results, latency micro-benchmarks. They had no answer.

The problem wasn’t the work. The problem was the question they’d started with: which LLM is best?

By the time most teams in regulated industries get to model selection, the question has already been framed wrong. The frame that actually works is different: which model is best for this specific use case, given our constraints?

That sounds obvious. It isn’t, in practice. I see teams spend months stuck in evaluation theater running benchmarks that don’t reflect their actual workload, or picking the model with the highest marketing energy and patching around its weaknesses later.

This post is the framework we use to make model selection decisions for our clients. It’s the same one behind an enterprise LLM gateway I helped build — currently serving 1000+ users and processing 5M+ requests across multiple production deployments. The framework is a decision process, not a benchmark. It works for both LLMs and SLMs, and it’s deliberately simple, because complexity in the framework just hides the real tradeoffs.

The four dimensions that actually matter

Every model selection decision we make goes through four questions, in order:

  1. Capability — can the model do the task well enough?
  2. Cost — does the economics work at our expected volume?
  3. Latency — does it meet the user-experience requirements?
  4. Use-case fit — does it match the deployment constraints (on-prem, regulated, fine-tunable)?

These aren’t ranked by importance — they’re ranked by order of evaluation. You disqualify models on capability first, then narrow on cost, then check latency, then verify fit. By the time you get to the last step, you usually have one or two candidates left.

Let me walk through each one.

1. Capability: can it do the task?

Most teams start here and finish here. That’s a mistake capability is a filter, not a decision.

The trap is using general benchmarks (MMLU, MT-Bench, etc.) to evaluate models for specific tasks. A model can score 90% on MMLU and still be terrible at extracting structured data from KYC documents. The benchmarks are signals about general intelligence, not your use case.

What we do instead:

❝ A model can score 90% on MMLU and still be terrible at extracting structured data from KYC documents. ❞

Anything below the capability bar gets cut. Everything above it moves to the next dimension. The bar is set by what the use case actually needs — not by what’s “best.”

A common surprise: SLMs often clear the capability bar for narrow tasks. Most people assume they don’t and skip them. That’s where the next dimension matters.

2. Cost: does the economics work?

This is where most “best model wins” thinking falls apart.

Press enter or click to view image in full size
Illustrative figures based on public 2026 pricing

If your use case is going to handle 1,000 requests a day, the cost difference between a frontier model and a mid-tier one is negligible. Pick the better model.

If your use case is going to handle 1 million requests a day, the cost difference is the entire economic argument for the project. The difference between $30/1K requests and $5/1K requests is the difference between a deployable product and an unusable one.

We model cost across three dimensions:

A pattern I’ve seen repeatedly: a frontier model produces slightly better outputs but costs 10–12x more per request than a strong mid-tier alternative. At low volumes, the gap is invisible. At high volumes, that 10–12x becomes the entire P&L of the project. The mid-tier model usually requires 5–10% more downstream review time on a fraction of cases — and that trade is almost always worth it.

The lesson: capability gaps that look small in benchmarks become economically irrelevant or economically critical depending on volume. You can’t decide which until you do the math.

3. Latency: does it meet UX requirements?

This is the dimension most teams underweight, and it’s the one users notice fastest.

Three patterns:

❝ P95 latency is what determines user experience, not the median. ❞

The latency profile of the model matters more than the average. P95 latency (the slowest 5% of responses) is what determines user experience, not the median. A model that averages 800ms but spikes to 12 seconds 5% of the time will feel broken.

For SLMs deployed on-prem, latency becomes a hardware question, not a model question. The same model can be 200ms or 4 seconds depending on the inference setup.

4. Use-case fit: deployment constraints

This is where financial services and other regulated environments diverge from the general AI playbook.

Questions we ask:

This dimension is where SLMs often win when the first three rounds suggested an LLM. A capable enough SLM that runs entirely on the bank’s own infrastructure can beat a “better” frontier model on the only dimension that matters at the deployment review.

How the framework runs in practice

Putting it together: for each new use case, we run the four dimensions as a funnel.

  1. Build a representative test set. Define the capability bar. Filter candidates.
  2. Model the economics at projected volume. Filter again.
  3. Test latency under realistic load. Filter again.
  4. Layer in the deployment constraints. By this point, we usually have one to two candidates left.
  5. Run a final A/B in shadow mode (the model’s outputs run in parallel with the existing process, but don’t drive decisions) before promoting to production.

The whole cycle takes between two and six weeks depending on use-case complexity. That’s faster than it sounds, because most of the time is data preparation for the test set — once that exists, scoring multiple models is quick.

What changes when you do this

A few patterns I’ve seen play out across multiple client engagements:

❝ The “best” model loses about half the time. SLMs are radically underused. ❞

What this framework doesn’t do

A few honest caveats:

The simpler version

If you’re a senior leader asked to make this decision and don’t have time for the framework: the right model for your use case is rarely the most-marketed one. Build a small representative test set, score three to five candidates honestly, do the math at your real volume, and pick the smallest model that does the job. That heuristic will get you 80% of the way there.

The rest of this framework just makes the 80% more defensible to the people downstream — risk, compliance, audit, finance, and the engineers who’ll have to live with the choice.

I lead AI and product work at Synechron, focused on how regulated enterprises adopt GenAI without breaking themselves. If you’re working through similar decisions in financial services, I’d be interested to hear how your evaluation process differs from this one.

This article was originally published on Fintech Tag and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →