I Own the ML Thresholds. Here’s What That Actually Means.

Most teams ship ML models. Almost no PM owns what the model decides. This is the difference — and it took me years of expensive lessons to understand it fully.

Press enter or click to view image in full size

I need to be honest with you before I start.

I didn’t figure this out because I’m especially smart.

I figured it out because I had no choice.

When you’re accountable for a real-time decision system making irreversible financial calls on every single Zelle transaction at USAA — 30+ signals, sub-second latency, allow or step-up or block, no fallback, no undo — you either develop a governance mindset or you cause harm at scale.

That’s not a humble-brag. That’s the reality of the environment. And I want to be clear: I made mistakes in that environment. Some of my early threshold decisions were wrong. I caught them — but only because the governance architecture forced me to look.

That’s the point.

Governance isn’t about being smart enough to get it right the first time.

It’s about building a system that catches you when you’re wrong — before scale makes wrong irreversible.

That insight took me longer to internalize than I’d like to admit. And I think it’s the insight most product writing about AI completely misses.

So let me share what I actually learned — including the parts that surprised me.

The insight that reframed everything

Here’s what I believed before working in real-time payments:

Good ML product management means defining clear requirements, selecting the right model, and measuring outcomes post-launch.

Here’s what I learned after:

Good ML product management means designing a system of decisions that surrounds the model — because the model will be wrong, and the only question is whether you catch it before or after it causes irreversible harm.

This isn’t a subtle distinction. It’s a completely different operating philosophy.

The first version treats the model as the product. You ship it, you measure it, you iterate on it.

The second version treats the model as one component inside a larger decision system — a system that includes governance cadence, threshold ownership, drift detection, rollback infrastructure, and signal quality auditing.

The model is the engine. The governance system is what makes the engine safe to run at speed.

Most teams build the engine. Almost no one builds the safety system with the same intentionality.

And the reason is subtle: the governance system never appears in a demo. It never generates a feature announcement. It never shows up as a milestone on a roadmap. Its entire value is in preventing things that don’t happen — which makes it invisible to everyone except the people who understand what it’s preventing.

The five things I learned that most product writing doesn’t cover

1. Precision and recall are not model metrics. They are business decisions.

I’ve said this before, but I want to go deeper — because the implications are more radical than they first appear.

When a data scientist defines precision and recall targets, they’re asking: what performance can this model achieve?

When a product manager defines precision and recall targets, they’re asking: what cost is the business willing to absorb for being wrong — and in which direction?

These are fundamentally different questions. The first is bounded by model capability. The second is bounded by business strategy, risk tolerance, and regulatory exposure.

Here’s the part that surprised me: the business answer often constrains the model choice more than the technical answer does.

At USAA, the asymmetry of consequences — an unauthorized Zelle transfer is categorically worse than a blocked legitimate transaction, but a blocked legitimate transaction at scale is also categorically worse than a policy document acknowledges — meant that the acceptable precision/recall operating range was actually quite narrow. Narrower than the model could achieve through training optimization alone.

That narrowness forced specific architectural choices: composite scoring instead of single-signal blocking, progressive trust for new users, explicit fallback logic for every edge case.

Those weren’t engineering decisions. They were product decisions — made before the model was built, not discovered after.

2. The most dangerous number in ML is a metric that looks fine.

This is the one that keeps me up at night.

Signal drift doesn’t announce itself. Model degradation doesn’t generate an alert. False positive rate creep doesn’t appear in your weekly dashboard until it’s already become a problem that compounds daily.

The most expensive ML failures I’ve studied share a common pattern: everything looked fine — right up until it didn’t. And by the time the dashboard showed a problem, the damage had been accumulating for weeks or months.

The governance insight here is counterintuitive: the absence of a red flag is not evidence that things are healthy. It’s evidence that your detection system is working — or that your detection system isn’t sensitive enough.

I built daily automated drift scoring specifically because I didn’t trust the absence of obvious signals. Not because I was pessimistic, but because I’d learned that a well-trained model on a stable dataset looks exactly the same as a model beginning to drift on a changing dataset — until you measure drift explicitly.

Most teams don’t measure drift explicitly. They wait for performance metrics to degrade. By then, the model has been making systematically worse decisions for long enough that tracing the root cause requires forensic investigation.

3. Governance cadence is architecture — and it has to be designed before launch, not after.

Here’s something I got wrong early in my career and corrected at USAA:

I used to think governance was something you established post-launch, once you understood the production environment.

I was wrong. Completely wrong.

Governance cadence is architecture in the same way that API contracts, error handling, and database schema are architecture. It has to be designed before the first sprint of build begins — because every decision you make during build will either enable or undermine your ability to govern the system after launch.

If you design the threshold governance model after launch, you’re retrofitting accountability onto a system that wasn’t built with accountability in mind. It’s the product equivalent of adding seatbelts after the car is already on the highway.

My biweekly review cadence was locked before Sprint 1. Not Sprint 8. Not after the first production incident. Before any code shipped.

That decision shaped the entire system design. It forced the threshold management interface to be a PM-controlled tool, not a code-level configuration. It forced the audit logging to be immutable and human-readable. It forced the rollback mechanism to be sub-60-second, not sprint-length.

All of those architectural choices trace back to one governance decision made before a single line of code was written.

4. Automatic retraining is one of the most dangerous defaults in production ML — and almost everyone uses it.

I want to be careful here, because this one is genuinely counterintuitive and I don’t want to overstate it.

Automatic model retraining — where the model continuously updates based on new data without human review — sounds like a feature. It sounds like the system getting smarter over time.

In some contexts, it is. In high-stakes, regulated, irreversible-consequence contexts, it’s a risk that deserves explicit product governance.

Here’s why: a model that retrains automatically incorporates everything it observes — including noise, including adversarial patterns, including the consequences of its own previous decisions. Without explicit signal exclusion rules and human review of retraining triggers, you can end up with a model that has learned patterns from its own errors — a feedback loop that degrades performance gradually and invisibly.

I’ve seen this described in academic literature. I’ve experienced the early warning signs in production. It’s real, it’s subtle, and it’s almost never discussed in product management writing about AI.

My governance model had explicit signal exclusion rules — specific categories of data that were never fed back into training — specifically to prevent feedback loop corruption.

That wasn’t an engineering requirement. It was a product requirement that I defined, documented, and made non-negotiable before the model architecture was finalized.

5. The rollback infrastructure is a product feature — and it reveals everything about your governance philosophy.

How quickly can you reverse a threshold change?

This single question tells you almost everything you need to know about whether an ML product is actually governed or just deployed.

If the answer is “under 60 seconds, PM-controlled, no engineering required” — the team has built governance infrastructure.

If the answer is “a few days, once we get an engineering ticket in” — the team has built a feature and called it a product.

The difference matters enormously in practice. When rollback is trivially fast and PM-controlled, every governance decision becomes less risky — because the cost of being wrong is low. You can move decisively on threshold adjustments, because you know you can reverse them instantly if the data shows you were wrong.

When rollback is slow and engineering-dependent, every governance decision becomes politically fraught — because the cost of being wrong is high. Teams start defending threshold decisions instead of re-evaluating them. The governance cadence becomes a bureaucratic exercise instead of a genuine learning mechanism.

The rollback infrastructure is, in a real sense, the physical manifestation of your governance philosophy. And it has to be built intentionally — before launch, as a product requirement — because retrofitting fast rollback onto a deployed ML system is expensive, slow, and sometimes architecturally impossible.

The thing I didn’t expect to learn

I expected this work to teach me about ML. About fraud detection. About payment systems architecture.

It did teach me those things.

But the deepest lesson was about something else entirely.

The deepest lesson was about intellectual humility as a system design principle.

The governance architecture I built at USAA was, at its core, a structure for catching my own mistakes before they scaled. Every review cycle, every drift alert, every rollback mechanism — all of it was built on the assumption that I would be wrong at some point and that the system needed to be designed around that certainty.

That’s not a comfortable assumption to build into a product. It requires genuinely believing that your current threshold decisions might be suboptimal — not as a rhetorical admission, but as a design constraint.

Most governance systems fail because they’re built to validate decisions, not to catch errors. They’re designed to prove that the current thresholds are correct, not to systematically question whether they should change.

The difference between a governance system that works and one that doesn’t is often just that — whether the PM who built it designed it to challenge their own assumptions, or to protect them.

I’ve tried to build governance systems that challenge mine. I don’t always succeed. But I’ve learned that the attempt itself — the genuine willingness to be shown wrong by the data — is the foundation everything else is built on.

What I want to leave you with

I’m not writing this because I have it all figured out.

I’m writing this because I’ve spent years in environments where the cost of not figuring it out was measured in unauthorized money transfers, blocked transactions, regulatory exposure, and member trust.

That context forced a clarity I don’t think I’d have developed otherwise.

And the clarity, distilled to its simplest form, is this:

The model is not the product. The governance system is the product. And building it well requires the same rigor, intentionality, and humility that you’d bring to any other system that makes real decisions with real consequences.

If you’re building AI products right now — and most product leaders are — I hope something in here is useful.

Not because I’ve said anything brilliant.

But because I’ve said things I wish someone had told me earlier.

I’m Andrés Garcia — Senior Product Manager specializing in payments, AI/ML systems, and regulated platform environments. I led TDV, a real-time ML trust decision system, at USAA, and the $1.9T Thinkorswim/TD Ameritrade integration at Charles Schwab.

I write about the governance layer of product management — the decisions most PM writing avoids because they’re hard to describe without having lived them.

Full portfolio → https://deft-genie-852849.netlify.app

LinkedIn → https://www.linkedin.com/in/andygarcia23/