Member-only story
The Monitoring Setup That Catches Bugs Before Your Users Do
You can’t fix what you can’t see. And right now, you’re blind.
Anas Issath9 min read·Just now--
It was 6:47 AM on a Saturday. My phone buzzed. A Slack alert from our monitoring bot: “P95 response time on /api/orders exceeded 2000ms for 5 minutes."
I opened Grafana on my phone. The response time graph showed a clear inflection point at 6:31 AM — latency had started climbing steadily. I clicked through to the Prometheus metrics. Memory usage on worker 3 was at 98% and climbing. A quick Sentry check showed no new errors — the app wasn’t crashing, it was drowning.
I SSHed in and restarted worker 3. Response times dropped back to normal. Then I checked the structured logs — a batch processing task that started at 6:30 AM was loading an entire CSV file into memory instead of streaming it. The file had grown to 400MB overnight.
The fix was a two-line change. But here’s the important part: no user reported an issue. The alert fired 6 minutes into the problem. I had it fixed before most users woke up. Without monitoring, that memory leak would have crashed the entire API during peak hours on Monday.
This is what monitoring gives you — time. Time to notice, time to diagnose, time to fix. Here’s the exact setup that bought us those 6 minutes…