One slow API call just ate your entire thread pool. 200 healthy endpoints are returning 503 because of a single bad dependency.

Part 7 of my series on Resilience Engineering. Last time, we explored rate limiting and backpressure — teaching systems to say “no” before they drown. This time: what happens when the damage is already inside the walls, and one bad dependency drags everything else down with it. Follow along for more deep dives into building systems that don’t fall apart.
The Thread Pool That Ate Everything
Tuesday afternoon, about 2:30. An e-commerce platform I was consulting for handles 50,000 requests per minute across roughly 200 endpoints. Everything looks fine. Dashboards are green. Then a third-party recommendation service — one of those “customers who bought X also bought Y” APIs — starts responding slowly. Not failing outright, just slow. 30 seconds instead of 200 milliseconds.
The application server has a shared thread pool. 200 threads, serving everything. The recommendation endpoint starts consuming threads that just… sit there. Waiting. Hanging for 30 seconds each. Within about two minutes, all 200 threads are blocked waiting for recommendations to come back.
And then checkout goes down. The payment gateway is fine. The database is fine. There are literally no threads left to serve checkout requests. Product search goes down. User profiles go down. The health check endpoint goes down, which makes the load balancer think the whole server is dead, which makes things spiral further.
One slow third-party API. 200 healthy endpoints. Total platform outage. The recommendation service didn’t even fail — it just got slow. And that was enough.
This is what happens when all your resources live in one shared pool. When one consumer gets stuck, every other consumer starves. It’s like having one bank account for your mortgage, groceries, and entertainment budget. If one category overspends, everything else bounces.
The fix comes from an unlikely place: naval architecture.
Resource Isolation: The Core Idea
The bulkhead pattern is straightforward in concept: give each critical dependency or functionality its own isolated pool of resources. When one pool is exhausted, the others keep operating normally.
Instead of one shared thread pool of 200 threads serving everything, you create separate pools. The checkout flow gets 80 threads — it’s your revenue. Product search gets 60. Recommendations get 30. Everything else shares the remaining 30.
When the recommendation service goes slow and its 30 threads are all blocked, the checkout flow still has 80 threads available. Product search still has 60. The recommendation endpoint returns errors, but the rest of the platform stays healthy. That’s the whole idea, really.
from concurrent.futures import ThreadPoolExecutor
from functools import wraps
class BulkheadRegistry:
"""Manages isolated thread pools for different service dependencies."""
def __init__(self):
self._pools: dict[str, ThreadPoolExecutor] = {}
def register(self, name: str, max_threads: int):
self._pools[name] = ThreadPoolExecutor(
max_workers=max_threads,
thread_name_prefix=f"bulkhead-{name}"
)
def get_pool(self, name: str) -> ThreadPoolExecutor:
if name not in self._pools:
raise ValueError(f"No bulkhead registered for '{name}'")
return self._pools[name]
# Configure bulkheads at startup
bulkheads = BulkheadRegistry()
bulkheads.register("checkout", max_threads=80)
bulkheads.register("search", max_threads=60)
bulkheads.register("recommendations", max_threads=30)
bulkheads.register("default", max_threads=30)
# Use the appropriate bulkhead for each request
async def handle_checkout(request):
pool = bulkheads.get_pool("checkout")
future = pool.submit(process_checkout, request)
return future.result(timeout=5) # 5 second timeout
async def handle_recommendations(request):
pool = bulkheads.get_pool("recommendations")
try:
future = pool.submit(fetch_recommendations, request)
return future.result(timeout=2) # 2 second timeout
except Exception:
return default_recommendations() # Graceful fallback
When the recommendations pool is exhausted, handle_checkout is completely unaffected. The failure stays contained within its bulkhead.
What to Isolate
Not everything needs its own bulkhead. Over-isolation wastes resources — 10 pools of 20 threads each is less efficient than one pool of 200 when everything is healthy. The goal is to isolate the things most likely to fail or most critical to protect. Where exactly you draw those lines depends on your system, and honestly, you’ll probably adjust them a few times before they feel right.
Isolate by external dependency. Most third-party service calls should get their own resource pool. Third-party services tend to be the most common source of latency spikes — you don’t control their SLAs, their deployments, or their capacity planning. The recommendation service, the payment gateway, and the shipping calculator — each gets its own bulkhead. Whether you also isolate smaller, less critical integrations is a judgment call based on how much overhead you’re willing to carry.
Isolate by criticality. Revenue-generating endpoints (checkout, payments) shouldn’t have to compete for resources with non-critical ones like analytics or A/B tests. Give critical paths dedicated resources that nothing else can consume.
Isolate by failure characteristics. This one’s more nuanced. Group endpoints by how they fail. Endpoints that call the same database can probably share a bulkhead — if the database is slow, they’ll all be slow anyway. Endpoints that call different external services should have separate bulkheads because their failure modes are independent. I say “probably” because I’ve seen cases where one query pattern hammers the database while others are fine, and a shared bulkhead didn’t help. You have to know your traffic.
A practical starting point: one bulkhead per external dependency, one for your critical path, and one shared bulkhead for everything else. That’s typically 5–8 bulkheads for a medium-complexity service. Refine from there based on what production data tells you.
Beyond Thread Pools: Connection Pools, Memory, and CPU
Thread pool isolation is the most common form of bulkheading, but it’s worth remembering that any shared resource is a potential failure propagation path.
Connection pools are the second most common culprit. Your service has a database connection pool of 50 connections shared across all endpoints. A slow query from the reporting endpoint holds connections for 30 seconds. The pool fills up. The checkout endpoint can’t get a database connection.
The fix: separate connection pools for critical and non-critical queries. Or better yet, separate database read replicas — the reporting endpoint queries the replica, the checkout endpoint queries the primary. They can’t interfere with each other because they’re hitting entirely different database instances. This is one of those changes that feels like overkill until the first time it saves you.
Memory isolation matters for services that process variable-size payloads. A file upload endpoint that accepts 100MB files can consume all available memory, starving other endpoints. Kubernetes resource limits provide memory isolation at the container level, but within a single process, you need application-level controls like request size limits and streaming processing.
CPU isolation is handled reasonably well by container orchestration. Kubernetes CPU limits and requests help ensure one pod can’t starve others on the same node. Within a single service, though, CPU-intensive operations (image processing, encryption, compression) should be offloaded to dedicated worker pools so they don’t block request-handling threads. I’ve seen teams skip this step and regret it when a batch of image resizes brought their API response times to a crawl.
Bulkheads and Circuit Breakers: Better Together
Bulkheads and circuit breakers solve different problems, and they tend to be most effective in combination.
A circuit breaker detects when a dependency is failing and stops sending requests to it — preventing wasted resources and giving the dependency time to recover. A bulkhead limits the resources a dependency can consume, preventing a slow dependency from exhausting shared resources. Different angles on the same underlying concern.
Without a circuit breaker, a slow dependency fills its bulkhead with blocked threads. The bulkhead prevents cascade failure, but those threads are wasted — sitting idle, waiting for timeouts. Add a circuit breaker, and the circuit opens after detecting the slowness. Requests fail fast instead of blocking, and the bulkhead threads are freed immediately.
Going the other direction: without a bulkhead, a circuit breaker protects against a failed dependency but can’t prevent resource exhaustion during the detection window. It takes some number of failed requests (typically 5–10, depending on your configuration) for the circuit breaker to trip. During that window, threads are being consumed. With a bulkhead in place, even during the detection window, only the isolated pool is affected.
The combination works well in practice. The bulkhead contains the blast radius, while the circuit breaker minimizes the duration. A slow dependency consumes a limited number of threads for a limited time before requests fail fast with a fallback response. Neither mechanism alone gives you that.
async def handle_recommendations(request):
# Circuit breaker checks first — fail fast if dependency is known-bad
if circuit_breaker.is_open("recommendations"):
return default_recommendations()
# Bulkhead limits resource consumption
pool = bulkheads.get_pool("recommendations")
try:
future = pool.submit(fetch_recommendations, request)
result = future.result(timeout=2)
circuit_breaker.record_success("recommendations")
return result
except TimeoutError:
circuit_breaker.record_failure("recommendations")
return default_recommendations()
except Exception as e:
circuit_breaker.record_failure("recommendations")
return default_recommendations()
Sizing Bulkheads: The Math
Getting bulkhead sizes wrong can be worse than not having them at all. Too small, and you’re artificially limiting throughput during normal operation. Too large, and the bulkhead doesn’t actually contain anything meaningful.
The formula is a reasonable starting point: pool_size = peak_requests_per_second × average_latency_seconds × safety_factor
If your checkout endpoint handles 100 requests per second with 200ms average latency, you need: 100 × 0.2 × 1.5 = 30 threads. The 1.5x safety factor accounts for latency variance — though in my experience, latency distributions have long tails, and you might want to look at your p99 before committing to that multiplier.
For external dependencies, use the timeout value instead of average latency: pool_size = peak_rps × timeout_seconds × safety_factor. If recommendations handle 50 requests per second with a 2-second timeout: 50 × 2 × 1.5 = 150 threads. That's a lot — which is why timeouts should be aggressive. A 500ms timeout gives you: 50 × 0.5 × 1.5 = 38 threads. Much more reasonable.
This is why short timeouts matter so much in bulkheaded systems. The timeout directly determines how many threads you need, which determines how much resource overhead the pattern adds. Aggressive timeouts (200–500ms for most external calls) keep bulkhead sizes manageable. Though “aggressive” is relative — if your dependency legitimately needs 800ms to respond under normal conditions, a 500ms timeout is just going to generate false failures. Know your baselines.
Monitor your bulkhead utilization in production. If a pool consistently runs at 80%+ utilization during normal traffic, it’s undersized. If it never exceeds 20%, it’s oversized, and you’re wasting resources. Somewhere around 40–60% utilization during peak normal traffic is a decent target — that leaves headroom for spikes without over-provisioning. But these are guidelines, not laws. Your mileage will vary depending on traffic patterns and how bursty your load is.
Kubernetes-Native Bulkheads
If you’re running on Kubernetes, you get several bulkhead mechanisms out of the box.
Pod resource limits isolate CPU and memory at the container level. One pod can’t starve another on the same node. Set requests (guaranteed minimum) and limits (maximum allowed) for every pod — this is one of those things that’s easy to skip during development and painful to retrofit later.
Namespace resource quotas isolate resources at the team or service level. The recommendations namespace gets 20 CPU cores and 40GB memory. The checkout namespace gets 50 CPU cores and 100GB memory. Even if recommendations go haywire, it can’t consume checkout’s resources.
Pod Disruption Budgets ensure that maintenance operations (node drains, rolling updates) don’t take down too many pods at once. If your checkout service has 10 pods and a PDB of minAvailable: 8, Kubernetes won't voluntarily evict more than 2 pods simultaneously. I've seen teams forget about PDBs and then wonder why a routine node upgrade caused a brief outage.
Service mesh rate limiting (Istio, Linkerd) provides per-service traffic limits at the network level. Service A can call Service B at 1,000 requests per second maximum. This is a network-level bulkhead that prevents one service from overwhelming another.
These mechanisms complement application-level bulkheads. Kubernetes handles isolation between services; application-level bulkheads handle isolation within a service. In most production setups, you’ll want both layers.
What to Do Monday Morning
Start with your most critical service — the one that generates the most revenue or handles the most traffic.
Week 1: Identify shared resources. List every thread pool, connection pool, and external dependency that’s shared across endpoints. This is your vulnerability map. It’s usually more extensive than people expect.
Week 2: Add a bulkhead for your most dangerous external dependency — the one most likely to go slow or fail. Give it a dedicated thread pool sized with the formula above. Add a timeout. Add a fallback.
Week 3: Add a bulkhead for your critical path. Checkout, payments, whatever generates revenue. Give it dedicated resources that nothing else can consume.
Week 4: Add monitoring. Track bulkhead utilization, rejection rates, and timeout rates. Without this data, you’re guessing at pool sizes, and guessing tends to go poorly.
The bulkhead pattern isn’t glamorous. It’s plumbing — the kind of boring infrastructure work that prevents spectacular failures. Ships have bulkheads, so a hull breach in one compartment doesn’t sink the whole vessel. Your services should have them, so one slow dependency doesn’t take down the platform. It’s not exciting work, but the on-call engineer at 2 AM will be grateful you did it.
Resources:
- Microsoft: Bulkhead Pattern
- Netflix Hystrix: Bulkhead Pattern (archived)
- Resilience4j: Bulkhead
- Kubernetes: Resource Management
- Google SRE Book: Handling Overload
Series Navigation
Previous Article: Rate Limiting and Backpressure: Protecting Systems from Themselves (Part 6)
Next Article: Incident Response: From Detection to Resolution in 10 Minutes (Part 8 — Coming soon!)
Coming Up: Incident response, database resilience, the cost of resilience, and Kubernetes resilience patterns
This is Part 7 of the Resilience Engineering series. Read Part 1: Cell-Based Architecture & Circuit Breakers, Part 2: Chaos Engineering in Production, Part 3: The $10M Blind Spot, Part 4: When Everything Fails, Part 5: The Day AWS Went Down, and Part 6: Rate Limiting and Backpressure.
About the Author: Daniel Stauffer is an Enterprise Architect specializing in resilience engineering and distributed systems. He designs systems where one slow API call doesn’t take down the entire platform.
The Bulkhead Pattern: Isolating Failure Domains Within Services was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.