Start now →

System Resilience: From Failure Modes to Design Patterns

By Okan Yenigün · Published March 30, 2026 · 12 min read · Source: Level Up Coding
Blockchain
System Resilience: From Failure Modes to Design Patterns

Your System Is Only as Good as Its Worst Day

Photo by Sajad Nori on Unsplash

Most systems look perfect when everything is working.

Request flow, responses return, dashboards stay green. It’s easy to believe that correctness is the goal, that if the system works under normal conditions, the job is done.

But real systems don’t live in normal conditions all the time.

Dependencies slow down, networks drop packets, and requests arrive in bursts. Some parts fail while others keep running. And in those moments, the question is no longer “Does it work?” but:

What happens when it doesn’t?

Systems are not defined by their success paths, but by how they behave under failure.

The happy path is a lie

Most systems are designed around a simple mental model:

Request → process → response

A clean, linear flow where every dependency is available, every call returns on time, and every component behaves exactly as expected.

This is the happy path.

Happy. Source

But it’s not where systems actually live. In reality, dependencies fail without warning.

The result is not a neat sequence of steps, but a constantly shifting environment where timing, load, and availability are never guaranteed.

The “normal case” is not the baseline; it’s the exception.

Service is down for 10 minutes

Let’s reduce everything to a simple system:

Service A depends on Service B.

Now assume that Service B is unavailable for 10 minutes.

What happens to the requests coming into A during those 10 minutes?

Each of these choices leads to a completely different system behavior. And the story doesn’t end when Service B comes back.

This simple scenario exposes almost every weakness in a system’s design.

The fundamental failure modes

Before we talk about solutions, we need to understand what actually goes wrong.

System failures are not random; they follow patterns. Once you start looking closely, you’ll see the same types of problems appearing again and again, regardless of the tech or architecture.

Overload

When requests pile up

Every system has a limit. It may not be obvious or fixed, but there is always a point beyond which the system cannot process incoming work fast enough.

Now imagine a small gap:

The system can handle 100 requests per second, and it receives 120.

Nothing dramatic happens at first; the system doesn’t crash, it keeps working. But those extra 20 requests per second don’t disappear; they start to accumulate.

Queues begin to grow, response times increase, and requests spend more time waiting than being processed. From the outside, everything still looks “up”, but internally, pressure is building.

Then something subtle happens:

Now the system is no longer with 120 requests per second; it might be dealing with 150 or more.

Overload is not a sudden failure, it is a slow buildup that eventually turns into collapse.

That’s what makes it dangerous. By the time the system visibly breaks, it has already been in trouble for a while.

Once overload starts, it rarely stays isolated; it triggers the next failure modes.

Retry amplification

When recovery makes things worse

When something fails, the natural reaction is to try again.

Retry amplification. Source

A request times out, so the client retries. If it fails again, it retries once more. On the surface, this looks resilient, an attempt to recover from temporary issues.

But under stress, this behavior turns dangerous.

Imagine a service that is already slow due to overload. Requests start timing out. Clients begin retrying. Now each original request is no longer one request; it becomes two, three, sometimes more.

The system is now receiving more traffic precisely when it is least able to handle it.

Retry storm feedback loop. Source

This creates a feedback loop:

What started as a recovery mechanism becomes a load multiplier.

The most dangerous part is that nothing is technically “wrong”. Each component is behaving as designed. But together, they amplify the problem instead of solving it.

This is how systems that were only slightly degraded end up fully unavailable.

Duplication

When the same work happens twice

In a perfect world, every request would be processed exactly once. In reality, you can’t rely on that.

Consider a simple scenario:

From the client’s perspective, it looks like a failure. So it retries.

Now the system receives the same request again and processes it again.

The result is duplication.

This is not a rare edge case. It is a direct consequence of timeouts, retries, and unreliable networks. The system cannot distinguish between a request that failed and a request that succeeded but whose response never arrived.

And the consequences are not just technical, they are real:

The key shift is this:

You must assume that any operation may be executed more than once.

Partial failure

Most systems are not a single unit; they are a collection of services, databases, and external dependencies.

Which means they don’t fail all at once.

Instead, some parts keep working while others degrade or stop responding.

One service might be fast, another might be slow, and a third might be unavailable. From the outside, the system is neither fully up nor fully down; it’s somewhere in between.

This is partial failure.

Partial failure amplified in microservices. Source

It’s one of the hardest problems to reason about, because behavior becomes inconsistent:

There is no single, clear signal that “the system is broken.” This creates ambiguity:

These are not edge cases; they are the normal operating conditions of real systems. And once partial failure appears, it interacts with everything else:

What looks like a small issue in one component quickly spreads across the system.

Core concepts that address these failures

Once you understand how systems fail, the next step is not to eliminate those failures; that’s impossible. Instead, we design around them.

Each of the failure modes we’ve seen leads to a specific kind of response:

These are not abstract concepts. They are practical mechanisms that shape how a system behaves under stress.

Good system design is not about avoiding failure, but about responding to it in a controlled and predictable way.

Backpressure

Controlling the flow

The first problem we saw was overload: the system accepts more work than it can handle. Backpressure is the response to that.

Instead of allowing unlimited requests to enter and hoping the system can keep up, backpressure enforces a simple rule:

Only accept as much work as you can process safely.

This means the system must sometimes slow down or reject incoming requests. That may sound counterintuitive. Why turn work away? But the alternative is worse: accepting everything and collapsing under the load.

Without backpressure:

With backpressure:

This shifts the system from failing internally to failing explicitly. Some requests may be delayed or rejected, but the system as a whole remains stable.

A system that tries to handle everything will eventually handle nothing.
Backpressure flow states. Source

Backpressure is not a single feature you turn on. It is a set of decisions about when and how the system refuses work.

Rate limiting

Define a hard boundary

The simplest form of backpressure is to set a limit:

What is rate limiting? Source

When the limit is reached, the system responds immediately instead of letting requests pile up. This protects capacity at the point of entry.

Bounded queues

Limit how much you buffer

Queues are useful, but they can become dangerous if they grow without limit.

A bounded queue introduces a cap:

Bounded queue. Source

This prevents memory exhaustion and unbounded latency.

Load shedding

Drop work selectively

Not all requests are equally important. Under stress, the system can drop low-priority requests and preserve critical ones. For example:

Load shedding. Source

When capacity is limited, survival depends on prioritization.

Timeouts

Avoid holding resources too long

Requests that take too long consume resources that could be used elsewhere. By enforcing timeouts:

API timeout. Source

This is a subtle form of backpressure; it limits how long work can occupy the system.

Signaling upstream

Slow the source

In some systems, backpressure is not just rejection; it is communication.

Instead of dropping requests, the system signals upstream: send less, slow down.

This is common in streaming or pipeline systems.

Idempotency

Making repeats safe

We’ve already seen that duplication is unavoidable. Requests can be retried. Messages can be delivered more than once. Responses can be lost even when the work has already been done.

You cannot guarantee that an operation will run exactly once.

Idempotency is the response to that reality.

API Designn — Idempotency. Source

An operation is idempotent if running it multiple times produces the same result as running it once.

This is usually done by attaching a unique identifier to the request:

When the system receives a request:

The operation is not executed again.

Without idempotency:

Idempotency is not an optimization. It is a safety guarantee under uncertainty.

Queuing

Absorbing instability

Some failures are not permanent; they are temporary.

A dependency might be down for a few minutes. A service might be slow under load. In these situations, the problem is not that the work cannot be done, but that it cannot be done right now.

Queuing is the response to that.

Instead of forcing immediate processing, the system separates accepting work & processing work.

Work is stored first and handled when the system is ready.

The system no longer depends on everything being available at the same moment. A request can be accepted immediately or processed later. This decoupling is what allows systems to remain stable under fluctuation.

Without a queue:

With a queue:

There is a trade-off. Queueing introduces delay. Work is no longer instantaneous; it becomes eventual.

Stability is gained by accepting time as a variable.

Queue delay. Source

Queues can hide problems. If the system is consistently slower than incoming traffic, then the queue grows, delays increase, and issues become visible only much later.

This is why queues must be bounded and monitored.

Dead letter queues

Accepting failure

Not all work can be completed. Some requests will fail; not because of timing, but because they are invalid, inconsistent, or impossible to process. No amount of retries or waiting will change that.

Dead Letter Queues (DLQ) are the response to this reality.

DLQ. Source

Instead of retrying forever, the system sets a boundary.

Try a limited number of times. If it still fails, isolate it.

Failed messages are moved to a separate queue, DLQ, where they can be inspected and handled later.

The system no longer assumes that everything must eventually succeed. Instead, it ensures that everything must either succeed or be accounted for.

The trade-off is that moving to a DLQ means accepting that:

Stability is preserved by isolating what cannot be fixed automatically.

Controlled retries

Not all retries are equal

Retries are necessary. Failures happen due to timeouts, temporary outages, or network issues. In many cases, simply trying again is enough to recover.

Every retry consumes capacity.

Retries should not be immediate, unlimited, or blind. They must be delayed, limited, and selective.

Using retry pattern. Source

Instead of retrying immediately, the system introduces time and variation:

This reduces pressure on the system and gives it time to recover.

A critical distinction:

Retrying everything is not resilience; it is inefficiency. A retry only makes sense if the next attempt has a chance to succeed.

Designing for failure

Failures in systems do not stay isolated. What begings as a small issue rarely remains contained. It interacts with other parts of the system and triggers a chain of reactions.

A service slows down. Requests start timing out. Clients retry. Load increases. Queues grow. Some components fail while others continue. What started as a minor degradation becomes a system-wide problem.

This is where the shift happens.

If failure is inevitable and interconnected, then system design cannot be based on ideal conditions. It must be based on what happens under stress.

Engineering is not only about making things work, but it is also about making sure they continue to work when they don’t.

With this perspective, the definition of success changes.

A system is working because:

Writing code gets a system to run. Understanding failure is what keeps it running.

Read More

Sources

https://www.telerik.com/blogs/building-resilient-apis-retry-pattern

https://www.geeksforgeeks.org/system-design/dead-letter-queue-system-design/

https://www.mdpi.com/1424-8220/23/6/3024

https://sketechnews.substack.com/p/idempotency-duplicate-requests

https://www.sportmonks.com/glossary/api-timeout-2/

https://www.srao.blog/p/load-shedding-when-your-api-needs

https://www.chipverify.com/systemverilog/systemverilog-queue

https://www.geeksforgeeks.org/system-design/rate-limiting-in-system-design/

https://systemdr.substack.com/p/backpressure-mechanisms-in-distributed

https://learn.microsoft.com/en-us/dotnet/architecture/microservices/implement-resilient-applications/handle-partial-failure

https://keyholesoftware.com/preventing-retry-storms-with-responsible-client-policies/

https://newsletter.pragmaticengineer.com/p/resiliency-in-distributed-systems-74c

https://blog.zeplin.io/collaboration/what-is-the-happy-path/


System Resilience: From Failure Modes to Design Patterns was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This article was originally published on Level Up Coding and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →