In the AI era, speed is the metric, but quality tells a better story

Every engineering dashboard I’ve seen tells the same story: PRs are up. Cycle time is down. Developers are shipping more code than ever.
Awesome. Now, show me your quality metrics.
I get it. The industry decided that AI adoption should be measured by output volume because it’s the story. The increase in productivity isn’t a line, it’s a bent hockey stick going straight to the stars. By those numbers, AI is a spectacular success.
But a growing body of evidence suggests we’re at least ignoring quality and, at worst, measuring the wrong thing. Output is up, but so are defects, outages, tech debt and, sadly, time to resolve issues.
I’ll Show You Quality Metrics
CodeRabbit analyzed 470 GitHub pull requests from 2025. 320 were AI-authored against 150 that were human-only. The review found that AI-generated code contains 1.7 times more issues overall than human-written code. Not cosmetic issues, either. Logic and correctness errors: the kind that pass every automated check and explode in production.
Worse: Security flaws appeared at 1.5 to 2 times the rate of human code. Excessive I/O operations were roughly eight times more common. Concurrency mistakes which only surface under load appeared twice as often.
Meanwhile, Cortex’s 2026 benchmark found incidents per pull request up 23.5% year-over-year across the industry as pull requests per author are up 20%.
In other words, we’re shipping more code and more of it is breaking. That’s the wrong hockey stick.
Last, IEEE Spectrum published a piece in January 2026 where Jamie Twiss, CEO of Carrington Labs, noted newer AI models are generating code that avoids crashes by removing safety checks (the equivalent of deleting a failing unit test) or producing incorrect output that matches the expected format. He argues that kind of silent failure is “far, far worse than a crash.”
The Pressure to Ship (With AI)
This might be survivable if teams could simply slow down and review more carefully. But the incentive structure is demanding we deliver faster. The message to engineers is clear: use more AI or look unproductive. That pressure doesn’t come with a corresponding message about reviewing more carefully.
Gergely Orosz reported in The Pragmatic Engineer that Meta now factors AI token usage into performance reviews. Uber built nearly a dozen internal systems to manage AI-generated code, but when measuring the impact, focused entirely on output volume, going so far as to classify those who use AI to generate more pull requests as “power users.” There’s no time to worry about bugs in a system like that.
Even Anthropic, the company that builds Claude, shipped a UX bug on their flagship website that impacted every paying customer and nobody internally caught it. Orosz pointed out the irony that roughly 80% of Anthropic’s production code is generated by Claude Code and Claude didn’t catch it. It took a viral complaint on X to get it fixed.
What Actually Needs to Change
Amazon learned this the hard way. After a string of outages, the company mandated a senior engineer sign-off on all AI-assisted changes from junior and mid-level engineers. The Financial Times reported that Amazon’s SVP characterized the recent site reliability as “not good” and made the normally optional operations meeting mandatory.
That’s a start, but it’s a people-gate bolted onto a process problem. Here’s what the data actually points to:
Measure what matters. Track defect density, rollback frequency, time-to-detect for production issues, and incident severity alongside velocity metrics. If output is up and quality isn’t keeping pace, you’re manufacturing debt.
Treat review volume as a first-class problem. AI generates code faster than humans can review without intentional changes to the review process. CodeRabbit’s research found that review fatigue from higher volumes leads directly to more missed bugs. If your review process hasn’t scaled and adapted to AI output, your quality gate is already broken.
Stop using AI adoption as a performance metric. The moment you measure engineers by how much AI they use, you’ve incentivized volume over judgment. We stopped measuring devs by lines of code a long time ago. We shouldn’t measure AI on it. Instead, measure outcomes.
Require the prompt alongside the PR. If the AI wrote the code, the reviewer should see the prompt that generated it. Without understanding intent, the reviewer is just checking syntax on logic they can’t evaluate. Also, prompts should be reviewed for correctness and efficiency.
The Real Question
2025 was the year of AI speed, but the question isn’t whether AI makes us faster. It does. The question is whether faster matters when nobody’s watching where we’re going.
We’re Shipping Faster Than Ever. Are We Shipping Better? was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.