Start now →

Transactional Outbox with RabbitMQ (Part 2): Handling Retries, Dead-Letter Queues, and Observability

By Sagar Maheshwary · Published March 6, 2026 · 18 min read · Source: Level Up Coding
Blockchain
Transactional Outbox with RabbitMQ (Part 2): Handling Retries, Dead-Letter Queues, and Observability

In Part 1, we built a correct and production-oriented foundation for the transactional outbox pattern using Go, PostgreSQL, and RabbitMQ. We ensured that domain state changes and event creation were atomic, published events asynchronously via a poll-based outbox worker, and enforced exactly-once processing on the consumer side using idempotency.

That baseline system already solves the hardest problem: correctness under partial failure.

But correctness alone isn’t enough in real-world systems.

Once this pattern hits production, new questions immediately surface:

This article answers those questions.

In Part 2, we extend the system from Part 1 with failure-aware behavior and operational guardrails, without changing the core model. The transactional outbox remains the source of truth; we simply make it resilient, observable, and easier to operate under load.

Specifically, we’ll cover:

This is not a rewrite of the system from Part 1.
Instead, it’s an evolution, the kind that happens after the first real incident.

By the end of this article, you’ll understand how to take a correct outbox implementation and turn it into one that’s boring to operate, even when things go wrong.

Table of Content

Failure as a First-Class Design Constraint

In distributed systems, failure isn’t an edge case, it’s the normal operating condition.

Message brokers go down, consumers crash and restart, networks partition, bad data makes it to production.

Retries exist because things will fail, not because we expect them to succeed eventually.

But retries are also dangerous.

Poorly designed retries don’t fix problems, they amplify them:

Not all failures are transient. Some messages will never succeed due to invalid payloads or schema mismatches. A healthy system must know when to stop retrying and move on.

Well-designed retries share a few properties:

The transactional outbox gives us a natural place to implement this safely: durable retry state, explicit event lifecycle, and isolation from request paths.

In the next section, we’ll extend the outbox worker with retry logic and exponential backoff, turning failures into delayed work instead of cascading outages.

Retrying Outbox Publishing (Producer-Side Retries)

In Part 1, a failed publish immediately moved the event to a failed state. That was intentional, it gave us a correct baseline without hiding failure.

In real systems, however, most publish failures are transient: brief broker outages, network hiccups, or short-lived resource pressure. Retrying these events is both safe and expected, as long as retries are designed carefully.

In this section, we extend the outbox worker to support bounded, delayed retries with exponential backoff, while keeping all retry state durable and observable.

Schema changes for retries

To support retries, the outbox needs to track how many times an event has been attempted and when it becomes eligible again.

The Part 2 schema extends outbox_events with two new columns:

CREATE TABLE outbox_events (
id TEXT PRIMARY KEY,
event_key TEXT NOT NULL,
payload JSONB NOT NULL,
status OutboxEventStatus NOT NULL,
retry_count INT NOT NULL DEFAULT 0,
next_retry_at TIMESTAMP NULL,
locked_at TIMESTAMP DEFAULT NULL,
locked_by VARCHAR(128) NULL,
failure_reason VARCHAR(128) DEFAULT NULL,
failed_at TIMESTAMP DEFAULT NULL,
traceparent TEXT NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);

Conceptually:

Retries change how workers select events, so indexing needs to reflect that.

CREATE INDEX idx_outbox_events_pending_ready
ON outbox_events (created_at)
WHERE status = 'pending';

CREATE INDEX idx_outbox_events_retryable
ON outbox_events (locked_at, next_retry_at, created_at)
WHERE status = 'in_progress';

idx_outbox_events_pending_ready Optimizes the common case: picking the oldest pending events ready to publish.
idx_outbox_events_retryable Speeds up recovery scenarios where:

Using partial indexes keeps them small and focused, avoiding unnecessary index scans on completed or failed events.

Making the worker retry-aware

Retries are modeled as delayed eligibility, not as a new status.

An event is eligible for publishing if:

The worker’s polling query is updated accordingly:

UPDATE outbox_events
SET
status = ?,
locked_at = NOW(),
locked_by = ?
WHERE id IN (
SELECT id
FROM outbox_events
WHERE
(
status = ?
OR (
status = ?
AND locked_at < NOW() - INTERVAL '30 seconds'
)
)
AND (
next_retry_at IS NULL
OR next_retry_at <= NOW()
)
ORDER BY created_at
LIMIT ?
FOR UPDATE SKIP LOCKED
)
RETURNING *

This ensures that:

Retrying on transient publish failures

When publishing fails, the worker no longer immediately marks the event as failed.

Instead, it schedules a retry by resetting the event to pending and pushing next_retry_at into the future.

func (o *Outbox) scheduleRetry(ctx context.Context, event *model.OutboxEvent) {
backoff := backoff(event.RetryCount, o.config.RetryDelaySeconds)

err := o.outboxEventService.UpdateState(ctx, event.ID, map[string]interface{}{
"status": model.OutboxEventStatusPending,
"retry_count": event.RetryCount + 1,
"next_retry_at": time.Now().Add(backoff),
"locked_at": nil,
"locked_by": nil,
})
if err != nil {
o.log.Error("Failed to schedule retry for event",
logger.Field{Key: "error", Value: err.Error()},
)
}
}

Key points:

Exponential backoff with jitter

Retries use exponential backoff with jitter to avoid retry storms when multiple workers fail at once:

func backoff(retryCount int, baseDelay time.Duration) time.Duration {
jitter := time.Duration(randInt(0, 1000)) * time.Millisecond
return time.Duration(1<<uint(retryCount))*baseDelay + jitter
}

func randInt(min int, max int) int {
return min + rand.Intn(max-min)
}

This gives us:

Retries are capped (e.g. max 3 attempts) to prevent infinite loops.

Moving permanently failed events to a DLQ

A Dead Letter Queue (DLQ) is a dedicated queue for messages that could not be processed successfully and should no longer be retried.
Its role is not automatic recovery, but isolation, visibility, and controlled handling of bad events.

Not all failures are transient.

If an outbox event exceeds the maximum retry count, it is treated as permanently failed and published to a DLQ instead of the normal exchange.

func (o *Outbox) publishToDLQ(
ctx context.Context,
ch *amqp091.Channel,
event *model.OutboxEvent,
procErr error,
) {
dlqEvent := map[string]interface{}{
"event_id": event.ID,
"event_key": event.EventKey,
"payload": event.Payload,
"failed_at": time.Now(),
"failure_reason": procErr.Error(),
}

err := o.rabbitmq.Publish(
ctx,
&rabbitmq.PublishOpts{
Ch: ch,
Exchange: o.amqpConfig.DLX,
RoutingKey: o.amqpConfig.DLQ,
Body: dlqEvent,
Headers: amqp091.Table{"message_id": event.ID},
},
)
if err != nil {
o.log.Error("Failed to publish event to DLQ", logger.Field{Key: "error", Value: err.Error()})
return
}
}

We use a dedicated exchange:

outbox.dlx

Instead of reusing the original event routing key (e.g. order.created), we publish using the DLQ name itself as the routing key:

order-service.dlq

The Order Service declares and binds:

Queue: order-service.dlq
Binding key: order-service.dlq

Using the DLQ name as the routing key makes DLQ ownership explicit and keeps failed events isolated per microservice, without overloading domain event routing keys with failure semantics.

Once published to the DLQ, the outbox event is marked as failed and is no longer eligible for retries.

If DLQ publishing fails, the event remains failed and is surfaced via logs and metrics. DLQs are an operational signal, not a delivery guarantee.

At this point, retries are no longer a best-effort loop, they’re a durable, observable state machine, with the outbox as the single source of truth.

Consumer-Side Retries Using Dead-Letter Exchanges

Once a message is successfully published, failures shift from the broker to the consumer. These failures fall into two broad categories:

Retrying both the same way is a mistake. Retrying permanent failures wastes resources, while retrying transient failures incorrectly can amplify load and slow down the entire system.

For consumer-side retries, we rely on RabbitMQ dead-letter exchanges (DLX) instead of retrying in application code.

Retry Topology

Retries are modeled as explicit retry levels, each with its own delay:

RetryConfig{
Levels: []RetryLevel{
{"retry.30s", 30 * time.Second},
{"retry.1m", 1 * time.Minute},
{"retry.5m", 5 * time.Minute},
},
}

Each retry level consists of:

When a message expires in a retry queue, RabbitMQ automatically republishes it to the original exchange using the same routing key.

This design allows a single retry topology to handle multiple event types (order.created, order.updated, etc.) without creating per-routing-key queues.

Classifying Failures

The consumer explicitly signals whether a failure is retryable:

var (
ErrPermanent = errors.New("permanent error")
ErrTransient = errors.New("transient error")
)

Examples:

if err := json.Unmarshal(message.Body, &body); err != nil {
return fmt.Errorf("%w: failed to unmarshal message body", ErrPermanent)
}

This keeps retry decisions deterministic and easy to reason about.

Retrying a Message

Retries are handled by republishing the message to the next retry exchange.

Retry state is tracked via headers:

headers["x-retry-count"] = retryCount + 1

Publish(
exchange: "outbox.events.retry.30s",
routingKey: message.RoutingKey,
)

The original message is always ACKed, regardless of success or failure.

We never rely on broker re-delivery. Retries are explicit and fully controlled by retry queues.

Why There Is No Native Jitter on the Consumer Side

TTL-based retries in RabbitMQ are deterministic: messages expire at fixed intervals and are re-queued at the same time they were originally published. Unlike producer-side retry logic, consumers do not have a natural place to introduce per-message jitter without delaying acknowledgements or holding messages in memory.

As a result, when many messages fail simultaneously, retries tend to align and create short-lived spikes in traffic. In high-throughput systems, one way to mitigate this is to shard retry queues per retry level and route messages deterministically using a hash of message identity. This spreads retries over time without introducing artificial sleeps or per-message jitter, though it’s generally better to observe retry behavior and throughput in production before optimizing the retry strategy further.

When Retries Are Exhausted

If no retry level remains, the message is treated as permanently failed and published to a dead-letter queue (DLQ):

Example:

Exchange: outbox.dlx
Queue: notification-service.dlq
Routing key: notification-service.dlq

At this point, the failure is considered operational or data-related and requires manual inspection.

What To Do With Messages in the DLQ

A DLQ is not a trash can. It’s a decision point.

Once a message lands in the DLQ, automated retries have stopped and the system is intentionally asking for human judgment.

In practice, DLQ handling usually falls into three categories:

1. Inspect and discard

Some messages are fundamentally invalid:

After inspection, these events are discarded, and the root cause is fixed upstream.
The DLQ acts as a safety net, not a backlog.

2. Fix data and replay

Some failures are recoverable with intervention:

These messages can be:

Because DLQ messages preserve payload and metadata, replay is deliberate, not accidental.

3. Escalate operationally

A growing DLQ is often a system-level signal:

In this case, the DLQ is less about the messages themselves and more about triggering alerts, dashboards, and on-call investigation.

DLQs shift failure from automation to operations.
They make failures visible, bounded, and actionable. which is exactly what a reliable system needs.

Retry & Failure Metrics

In the previous article, we focused on whether events are flowing at all.
Once retries and DLQs enter the picture, the question shifts slightly:

Are failures behaving the way we expect them to?

Retries are not inherently bad but unbounded or misaligned retries are.
The following metrics make retry behavior visible, bounded, and explainable, without turning this into an observability deep dive.

Retry attempts

outbox_retries_total

This counter tracks how many times an outbox event is scheduled for retry.

It answers a simple but important question:

How often are we failing, but recovering later?

A low, steady retry rate is usually acceptable.
A sudden spike often indicates:

This metric is incremented only when an event fails to publish and is rescheduled, not when it is first written.

metrics.OutboxRetriesTotal.Inc()

In practice, an alert would trigger only if retry spikes correlate with other symptoms, such as:

Retries without backlog growth usually mean the system is degrading gracefully.

Retry exhaustion

outbox_retry_exhaustions_total

This counter tracks how many events have exceeded the maximum retry limit.

At this point, the system has made a conscious decision:

This event will not be published through the normal path.

A non-zero value here is expected over long periods.
However, a sudden increase usually signals a persistent failure, while sustained growth means retries are no longer helping and are only delaying failure.

This metric is incremented exactly once per event, when retry limits are exceeded.

metrics.OutboxRetryExhaustionsTotal.Inc()

In practice, an alert would trigger if:

Dead-letter publishing

Once retries are exhausted, events are sent to a Dead Letter Queue (DLQ).

We track this explicitly using two counters:

outbox_dlq_published_total
outbox_dlq_publish_failed_total

These metrics distinguish between:

metrics.OutboxDLQPublishedTotal.Inc()
metrics.OutboxDLQPublishFailedTotal.Inc()

If outbox_dlq_publish_failed_total increases, it usually indicates:

In practice, an alert would trigger if:

Consumer-side retry & DLQ visibility

On the consumer side, we also track retry and DLQ behavior, alongside core processing metrics like:

Together, these metrics feed into a dedicated Consumer Processing Dashboard, giving a clear view of throughput, failure rates, and terminal outcomes without overloading the system with cardinality-heavy signals.

Separating concerns with dedicated dashboards

To avoid mixing responsibilities, we intentionally use two focused dashboards:

Outbox Reliability Dashboard

This dashboard answers one core question:

Is the event publishing pipeline healthy?

It focuses on:

Consumer Processing Dashboard

This dashboard answers a different question:

Are downstream business operations succeeding?

It focuses on:

Keeping these dashboards separate prevents alert fatigue and makes ownership clearer:
outbox issues are usually infrastructure or configuration problems, while consumer issues tend to be business logic or dependency failures.

Together, they provide a complete picture of the event lifecycle, from transactional write to final business outcome without overloading any single view.

Tracing & Outcome Visibility

In this article, we deliberately keep tracing lightweight.

There are no new spans for retries, DLQ hops, or internal helpers. Instead, we focus on making the existing spans more expressive by attaching the right attributes.

This keeps traces readable while still answering the questions you actually ask during incidents.

Outcomes as first-class signals

Internally, each processing attempt is classified into one of a small, well-defined set of outcomes:

While counters remain intentionally simple, these outcomes are attached as trace attributes, allowing us to reason about individual event lifecycles.

This avoids exploding metric cardinality while still preserving detail in traces.

Enriching spans with outcomes

Both the outbox publisher and the consumer record a single span per message, and attach an explicit outcome to it.

For the outbox:

attribute.String("outbox.outcome", "published|retry|dlq|failed")

For the consumer:

attribute.String("consumer.outcome", "success|retry|dlq|failed")

This makes it trivial to:

Instead of inferring intent from error messages or span status, the outcome is made explicit.

Retry visibility without span explosion

Retries are represented via attributes rather than additional spans:

attribute.Int("event.retry_count", event.RetryCount) //Outbox
attribute.Int("rabbitmq.retry_count", getRetryCount(message.Headers)) //Consumer

This avoids a common pitfall where retry-heavy systems generate deeply nested or noisy traces.

You still get full visibility into retry behavior, but without sacrificing trace clarity.

Trace–log correlation

While the dashboards in this article focus on metrics, the codebase also supports trace–log correlation.

All log statements accept a context via:

logger.WithContext(ctx)

When enabled, this automatically attaches trace and span IDs to logs, allowing you to:

For brevity, we don’t include log-based dashboards here, but the hooks are intentionally present so readers can experiment and extend the system if needed.

CDC-Based Outbox as an Alternative

Another way to implement the outbox pattern is by combining it with Change Data Capture (CDC).

Instead of running a background worker that periodically polls the outbox_events table, CDC-based approaches stream database changes directly from the database’s write-ahead log (WAL) into a message broker. Tools like Debezium can capture inserts into the outbox table and publish them to Kafka topics with very low latency.

At a high level, the flow looks like this:

From an application perspective, the outbox table remains exactly the same. Domain events are still created explicitly, with well-defined payloads and clear business intent. The difference lies entirely in how those events leave the database.

What CDC Changes

With CDC, new outbox rows are emitted almost immediately after the transaction commits. There is no polling interval, no periodic SELECT queries, and no need to tune batch sizes. Event delivery becomes effectively push-based, driven by database writes rather than background scans.

This can be especially attractive when:

In practice, CDC turns the outbox into a streaming source, while preserving the core benefits of the transactional outbox pattern: atomic writes, durable state, and explicit event lifecycles.

Trade-offs and Operational Cost

While CDC reduces polling-related concerns, it introduces a different kind of complexity.

A CDC-based outbox typically requires:

These components add operational surface area and failure modes that don’t exist in a simple polling-based design. Connector configuration, schema evolution, topic management, and offset handling become part of the system’s responsibility.

For many teams, this is a reasonable trade-off. For others, especially smaller teams or services with moderate throughput, the added infrastructure may outweigh the latency benefits.

Choosing Between Polling and CDC

Both polling-based and CDC-based outbox implementations solve the same core problem: reliably publishing domain events without breaking transactional guarantees.

Polling favors simplicity, explicit control, and minimal infrastructure.
CDC favors lower latency and push-based delivery, at the cost of operational complexity.

There is no universally “correct” choice. The right approach depends on scale, team maturity, and operational constraints.

As with most distributed systems patterns, real-world implementations may vary in complexity. The comparison here is intentionally simplified to highlight trade-offs, not prescribe a one-size-fits-all solution.

Conclusion

In this article, we extended the transactional outbox pattern with producer-side retries, consumer-side retries, and dead-letter handling, while adding observability through metrics, dashboards, and distributed tracing. With these building blocks, the outbox becomes robust, debuggable, and ready to handle transient and permanent failures gracefully.

The provided Prometheus metrics and Jaeger traces give you visibility into backlog, retry attempts, publish outcomes, and DLQ traffic, while logger.WithContext() allows logs to correlate with spans for deeper insight. Together, these tools make it easy to monitor and troubleshoot the system, and even experiment with load, stress, or chaos testing on your own.

Thanks for reading! If you have any questions or insights, please leave a comment.
You can also check out the full working example on GitHub.


Transactional Outbox with RabbitMQ (Part 2): Handling Retries, Dead-Letter Queues, and Observability was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This article was originally published on Level Up Coding and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →