I Built a Production-Grade Distributed Payment System — Here’s What I Learned About Fintech Reliability

Infinimus4 min read·Just now

“Circuit breakers, smart retries, reconciliation engines, and anomaly detection — built from scratch in Node.js”

Every fintech engineering team I’ve studied builds the same four things from scratch. And every time, at least one of them breaks in production in an expensive way.

I decided to build them properly, once, and document every decision. This is that write-up.

The Problem Statement

Here are the four failure modes I set out to solve:

1. Cascading failures

Your payment service calls your wallet service. Wallet service goes down. Now your payment service is also effectively down, because every request is waiting 10 seconds for a timeout. One service failure becomes a system failure.

2. Unintelligent retries

Your retry logic retries everything. Network error? Retry. Insufficient funds? Also retry. The second case charges the customer twice. This is how fintech companies lose money and users.

3. Silent reconciliation mismatches

A payment gets marked COMPLETED in your database. But the wallet debit never happened — maybe the wallet service crashed mid-transaction. You have no way to know. The customer was charged, the money never moved, and your support team finds out three days later.

4. Undetected fraud patterns

Same user, 10 payments in 30 seconds. Card testing attack. Nobody noticed until the chargebacks arrived.

The Architecture

Three decoupled microservices:

API Gateway — Auth, rate limiting, request routing
Payment Service — Payment lifecycle, all four reliability patterns
Wallet Service — Balance management, append-only ledger, ACID transactions

Infrastructure: PostgreSQL (separate databases per service), Redis (idempotency, locking, anomaly tracking), Docker.

ServiceResponsibilityAPI GatewayAuth, rate limiting, request routingPayment ServicePayment lifecycle, circuit breaker, anomaly detection, reconciliationWallet ServiceBalance management, append-only ledger, ACID transactions

Solution 1: Circuit Breaker

The pattern is simple. Track failures. After a threshold, stop calling the failing service entirely. After a timeout, test with one request. If it succeeds, resume normal operation.

Three states: CLOSED (normal), OPEN (fast-failing), HALF_OPEN (testing recovery).

class CircuitBreaker {
  async execute(fn) {
    if (this.state === 'OPEN') {
      const elapsed = Date.now() - this.lastFailureTime;
      if (elapsed < this.options.recoveryTimeout) {
        throw new Error(
          `Circuit OPEN. Retry after ${
            Math.ceil((this.options.recoveryTimeout - elapsed) / 1000)
          }s`
        );
      }
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await this.executeWithTimeout(fn);
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }
}

I tested this under a real wallet service outage. The full lifecycle:

CLOSED → OPEN → HALF_OPEN → CLOSED

After the wallet service recovered, two successful payments brought the circuit back to closed with uptime: 100%.

Key insight: The circuit breaker sits outside the retry loop. If the circuit is open, we don’t even attempt retries. This is the order that matters.

Solution 2: Smart Retry with Error Classification

The most important thing about retries is knowing when not to retry.

const NON_RETRYABLE = [
  'insufficient_balance',
  'card_expired',
  'account_blocked',
  'fraud_detected',
  'wallet_not_found',
];

const RETRYABLE = [
  'timeout',
  'econnrefused',
  'service_unavailable',
  '503', '429',
];function classifyError(error) {
  const msg = error.message.toLowerCase();
  if (NON_RETRYABLE.some(p => msg.includes(p))) return 'NON_RETRYABLE';
  if (RETRYABLE.some(p => msg.includes(p))) return 'RETRYABLE';
  return 'UNKNOWN';
}

Non-retryable errors abort immediately. Retryable errors get exponential backoff with jitter.

Jitter is critical, without it, all retrying services hit the recovering service at the same moment, creating a thundering herd.

const delay = opts.jitter
  ? cappedDelay * (0.5 + Math.random() * 0.5)
  : cappedDelay;

Solution 3: Reconciliation Engine

This is the one most teams skip. It is also the one that causes the most expensive production incidents.

The reconciliation engine runs across all payments in a time window and verifies each one against the wallet ledger:

javascript

async _checkPayment(payment) {
  // COMPLETED payment must have ledger entry
  if (status === 'COMPLETED') {
    const ledgerExists = await this.walletClient.verifyLedgerEntry(
      walletId,
      payment.gatewayTransactionId
    );
    if (!ledgerExists) {
      return {
        status: 'MISMATCH',
        reason: 'COMPLETED_LEDGER_MISSING',
        severity: 'CRITICAL',
      };
    }
  }

  // FAILED payment must NOT have ledger entry
  if (status === 'FAILED' && payment.gatewayTransactionId) {
    const ledgerExists = await this.walletClient.verifyLedgerEntry(
      walletId,
      payment.gatewayTransactionId
    );
    if (ledgerExists) {
      return {
        status: 'MISMATCH',
        reason: 'FAILED_BUT_DEBITED',
        severity: 'CRITICAL', // Double charge risk
      };
    }
  }
}

Four mismatch types detected:

🔴 CRITICAL

COMPLETED_LEDGER_MISSING — Payment marked completed, wallet never debited
FAILED_BUT_DEBITED — Payment failed but wallet was charged (double charge risk)

🟡 MEDIUM

STUCK_PENDING — Payment pending for more than 10 minutes
STUCK_PROCESSING — Payment stuck in processing for more than 5 minutes

Run it on demand:

GET /api/payments/reconcile/run?from=2026-01-01&to=2026-01-02

Result on 7 payments: matched: 7, mismatched: 0, durationMs: 17.

Solution 4: Anomaly Detection

Rule-based, Redis-backed, runs on every payment creation. Four rules:

const rules = {
  velocityLimit: 5,           // >5 payments from same user in 60s
  velocityWindowSec: 60,
  largeAmountThreshold: 5000, // Single payment >5000
  failedStreakLimit: 3,        // >3 consecutive failures same user
  duplicateWindowSec: 300,     // Same amount 3x in 5 minutes
  duplicateCountLimit: 3,
};

I tested this with a velocity attack, 6 rapid payments from the same user. The detector flagged payment 5 at severity: HIGH:

json

{
  "flagged": true,
  "reason": "VELOCITY_EXCEEDED: 5 payments in 60s",
  "severity": "HIGH"
}

Payment 6 triggered both VELOCITY_EXCEEDED and DUPLICATE_AMOUNT simultaneously.

Important design decision: Anomaly detection is non-blocking. The payment still gets created but gets flagged in logs and is queryable via API. This is the right trade-off, you don’t want to block legitimate payments due to false positives, but you do want visibility.

bash

GET /api/payments/anomaly/check/:userId

Performance

Load tested with k6, 50 concurrent virtual users, 2-minute ramp:

MetricResultThroughput51 req/secp50 latency9msp90 latency19msp95 latency32msMax latency177ms

What I Would Do Differently

Kafka instead of Redis Pub/Sub for events. Redis Pub/Sub is fire-and-forget — if a consumer is down, the event is lost. For financial events, you want Kafka’s durability and replay capability.

Scheduled reconciliation. Right now reconciliation is on-demand. A production system needs it running every 5 minutes automatically, with alerts piped to Slack or PagerDuty.

ML-based anomaly detection. Rule-based detection catches known patterns. ML-based detection catches unknown ones. The rules are a starting point, not an endpoint.

Everything is open source.

GitHub: github.com/Infinimus01/distributed-payment-system-

Stack: Node.js · PostgreSQL · Redis · Docker

If you have feedback on the architecture or spot something I got wrong, drop a comment. I’m genuinely interested.

Amlendu Pandey — Backend Engineer

LinkedIn: linkedin.com/in/amlendupandey16

GitHub: github.com/Infinimus01

I Built a Production-Grade Distributed Payment System — Here’s What I Learned About Fintech…

I Built a Production-Grade Distributed Payment System — Here’s What I Learned About Fintech Reliability

The Problem Statement

The Architecture

Solution 1: Circuit Breaker

Solution 2: Smart Retry with Error Classification

Solution 3: Reconciliation Engine

Solution 4: Anomaly Detection

Performance

What I Would Do Differently

Looking for a crypto payment gateway?

NexaPay — Accept Card Payments, Receive Crypto

I Built a Production-Grade Distributed Payment System — Here’s What I Learned About Fintech…

I Built a Production-Grade Distributed Payment System — Here’s What I Learned About Fintech Reliability

The Problem Statement

The Architecture

Solution 1: Circuit Breaker

Solution 2: Smart Retry with Error Classification

Solution 3: Reconciliation Engine

Solution 4: Anomaly Detection

Performance

What I Would Do Differently

Looking for a crypto payment gateway?

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

Exodus Rolls Out 'Exodus Pay' to Turn Bitcoin Wallet Into Spending App

China’s Mind-Blowing Palm Payment Revolution: Pay in Seconds with Just a Wave of Your Hand No…

Real-Time Payments Ended What Once Shielded Financial Systems

Do You Have ISO 27001? The Question That Quietly Defines Your Company

From Wallet Addresses to Human Identity: Why Moove.xyz Feels Like the Future of Web3 Payments

Cross-border payments have been plagued with problems such as high wire transfer fees, regulations…