How financial services evaluate LLMs and SLMs in a practical framework
Punit Shah7 min read·Just now--
A working framework from production, not a benchmark leaderboard.
Disclosure: This post was developed with AI assistance for structure and editing. The framework, examples, and judgment are mine.
A few months ago I sat in a workshop where a team had spent 3 weeks evaluating models for a single use case and still couldn’t say which one they’d ship. They had spreadsheets of MMLU scores, GPQA results, latency micro-benchmarks. They had no answer.
The problem wasn’t the work. The problem was the question they’d started with: which LLM is best?
By the time most teams in regulated industries get to model selection, the question has already been framed wrong. The frame that actually works is different: which model is best for this specific use case, given our constraints?
That sounds obvious. It isn’t, in practice. I see teams spend months stuck in evaluation theater running benchmarks that don’t reflect their actual workload, or picking the model with the highest marketing energy and patching around its weaknesses later.
This post is the framework we use to make model selection decisions for our clients. It’s the same one behind an enterprise LLM gateway I helped build — currently serving 1000+ users and processing 5M+ requests across multiple production deployments. The framework is a decision process, not a benchmark. It works for both LLMs and SLMs, and it’s deliberately simple, because complexity in the framework just hides the real tradeoffs.
The four dimensions that actually matter
Every model selection decision we make goes through four questions, in order:
- Capability — can the model do the task well enough?
- Cost — does the economics work at our expected volume?
- Latency — does it meet the user-experience requirements?
- Use-case fit — does it match the deployment constraints (on-prem, regulated, fine-tunable)?
These aren’t ranked by importance — they’re ranked by order of evaluation. You disqualify models on capability first, then narrow on cost, then check latency, then verify fit. By the time you get to the last step, you usually have one or two candidates left.
Let me walk through each one.
1. Capability: can it do the task?
Most teams start here and finish here. That’s a mistake capability is a filter, not a decision.
The trap is using general benchmarks (MMLU, MT-Bench, etc.) to evaluate models for specific tasks. A model can score 90% on MMLU and still be terrible at extracting structured data from KYC documents. The benchmarks are signals about general intelligence, not your use case.
What we do instead:
- Build a small, representative test set from the actual use case. For Credit Memo automation, that means real (anonymized) credit memos with known correct outputs. For KYC risk classification, it’s representative customer profiles. The set has to be small enough to run quickly and large enough to differentiate models.
- Run candidate models against this set. Not five or ten models — three to five. Pick from across the size spectrum: one frontier model, one strong mid-tier, and one or two SLMs.
- Score on task-specific metrics, not general ones. For extraction tasks: precision, recall, structured-output validity. For summarization: faithfulness to source, completeness against a reference. For classification: standard accuracy with confusion matrix.
❝ A model can score 90% on MMLU and still be terrible at extracting structured data from KYC documents. ❞
Anything below the capability bar gets cut. Everything above it moves to the next dimension. The bar is set by what the use case actually needs — not by what’s “best.”
A common surprise: SLMs often clear the capability bar for narrow tasks. Most people assume they don’t and skip them. That’s where the next dimension matters.
2. Cost: does the economics work?
This is where most “best model wins” thinking falls apart.
If your use case is going to handle 1,000 requests a day, the cost difference between a frontier model and a mid-tier one is negligible. Pick the better model.
If your use case is going to handle 1 million requests a day, the cost difference is the entire economic argument for the project. The difference between $30/1K requests and $5/1K requests is the difference between a deployable product and an unusable one.
We model cost across three dimensions:
- Per-request cost at expected token volumes (input + output)
- Annual cost at projected usage (this is where the number gets uncomfortable)
- Hidden costs fine-tuning if needed, dedicated capacity reservations, rate-limit buffers
A pattern I’ve seen repeatedly: a frontier model produces slightly better outputs but costs 10–12x more per request than a strong mid-tier alternative. At low volumes, the gap is invisible. At high volumes, that 10–12x becomes the entire P&L of the project. The mid-tier model usually requires 5–10% more downstream review time on a fraction of cases — and that trade is almost always worth it.
The lesson: capability gaps that look small in benchmarks become economically irrelevant or economically critical depending on volume. You can’t decide which until you do the math.
3. Latency: does it meet UX requirements?
This is the dimension most teams underweight, and it’s the one users notice fastest.
Three patterns:
- Synchronous user-facing — chatbots, live assistants, real-time copilots. Hard ceiling around 2–3 seconds for first token, ideally under 1. Frontier models with high capability often fail here.
- Asynchronous user-facing — overnight reports, queued document processing. Latency is irrelevant. Pick the best capability you can afford.
- Pipeline / agent-driven — LLM is one of many calls in a chain. Per-call latency multiplies. A 5-second model in a 6-step agent flow is a 30-second user wait.
❝ P95 latency is what determines user experience, not the median. ❞
The latency profile of the model matters more than the average. P95 latency (the slowest 5% of responses) is what determines user experience, not the median. A model that averages 800ms but spikes to 12 seconds 5% of the time will feel broken.
For SLMs deployed on-prem, latency becomes a hardware question, not a model question. The same model can be 200ms or 4 seconds depending on the inference setup.
4. Use-case fit: deployment constraints
This is where financial services and other regulated environments diverge from the general AI playbook.
Questions we ask:
- Data residency. Does the use case process data that can’t leave a specific jurisdiction? This eliminates most hosted-only frontier models or forces a private deployment.
- On-prem requirements. For some clients, certain data can’t even leave their own infrastructure. This usually forces an SLM or open-weight model that can run locally.
- Fine-tuning requirements. Does the use case need a model fine-tuned on proprietary data? Most frontier models don’t allow this; most open-weight SLMs do.
- Audit and explainability. Can the model’s outputs be traced and justified to regulators? This is partly a model question (smaller, more interpretable models are easier to defend) and partly a tooling question (logging, evals, gateway-level guardrails).
- Vendor concentration risk. Is the client comfortable with single-provider dependency? Many aren’t, which forces multi-model architectures or open-weight choices.
This dimension is where SLMs often win when the first three rounds suggested an LLM. A capable enough SLM that runs entirely on the bank’s own infrastructure can beat a “better” frontier model on the only dimension that matters at the deployment review.
How the framework runs in practice
Putting it together: for each new use case, we run the four dimensions as a funnel.
- Build a representative test set. Define the capability bar. Filter candidates.
- Model the economics at projected volume. Filter again.
- Test latency under realistic load. Filter again.
- Layer in the deployment constraints. By this point, we usually have one to two candidates left.
- Run a final A/B in shadow mode (the model’s outputs run in parallel with the existing process, but don’t drive decisions) before promoting to production.
The whole cycle takes between two and six weeks depending on use-case complexity. That’s faster than it sounds, because most of the time is data preparation for the test set — once that exists, scoring multiple models is quick.
What changes when you do this
A few patterns I’ve seen play out across multiple client engagements:
❝ The “best” model loses about half the time. SLMs are radically underused. ❞
- The “best” model loses about half the time. When you weight cost and latency honestly, frontier models lose to mid-tier ones often enough that the decision should never be assumed.
- SLMs are radically underused. For narrow extraction, classification, and summarization tasks in regulated industries, an 8B parameter SLM running on-prem often beats the 70B+ alternatives once you account for cost, latency, and deployment fit.
- The framework forces the conversation that matters. “Which model?” is a technical question. “What does this use case actually need, and what are we willing to trade?” is the business question. The four-dimension funnel forces the second question to surface.
What this framework doesn’t do
A few honest caveats:
- It doesn’t tell you when to not use an LLM at all. Sometimes the right answer is rules, classical ML, or a deterministic system. The framework assumes you’ve already decided an LLM is appropriate — that’s a separate conversation.
- It doesn’t address agentic workflows, where multiple model calls compound. The framework still applies, but the cost and latency math gets more complex. Worth a separate post.
- It assumes you have a way to actually deploy and govern these models in production. If you don’t have an LLM gateway, evaluation framework, and observability stack, model selection is the least of your problems.
The simpler version
If you’re a senior leader asked to make this decision and don’t have time for the framework: the right model for your use case is rarely the most-marketed one. Build a small representative test set, score three to five candidates honestly, do the math at your real volume, and pick the smallest model that does the job. That heuristic will get you 80% of the way there.
The rest of this framework just makes the 80% more defensible to the people downstream — risk, compliance, audit, finance, and the engineers who’ll have to live with the choice.
I lead AI and product work at Synechron, focused on how regulated enterprises adopt GenAI without breaking themselves. If you’re working through similar decisions in financial services, I’d be interested to hear how your evaluation process differs from this one.