Start now →

I Built a Production-Grade Distributed Payment System — Here’s What I Learned About Fintech…

By Infinimus · Published April 10, 2026 · 5 min read · Source: Fintech Tag
Payments

I Built a Production-Grade Distributed Payment System — Here’s What I Learned About Fintech Reliability

InfinimusInfinimus4 min read·Just now

--

“Circuit breakers, smart retries, reconciliation engines, and anomaly detection — built from scratch in Node.js”

Every fintech engineering team I’ve studied builds the same four things from scratch. And every time, at least one of them breaks in production in an expensive way.

I decided to build them properly, once, and document every decision. This is that write-up.

The Problem Statement

Here are the four failure modes I set out to solve:

1. Cascading failures

Your payment service calls your wallet service. Wallet service goes down. Now your payment service is also effectively down, because every request is waiting 10 seconds for a timeout. One service failure becomes a system failure.

2. Unintelligent retries

Your retry logic retries everything. Network error? Retry. Insufficient funds? Also retry. The second case charges the customer twice. This is how fintech companies lose money and users.

3. Silent reconciliation mismatches

A payment gets marked COMPLETED in your database. But the wallet debit never happened — maybe the wallet service crashed mid-transaction. You have no way to know. The customer was charged, the money never moved, and your support team finds out three days later.

4. Undetected fraud patterns

Same user, 10 payments in 30 seconds. Card testing attack. Nobody noticed until the chargebacks arrived.

The Architecture

Three decoupled microservices:

Infrastructure: PostgreSQL (separate databases per service), Redis (idempotency, locking, anomaly tracking), Docker.

ServiceResponsibilityAPI GatewayAuth, rate limiting, request routingPayment ServicePayment lifecycle, circuit breaker, anomaly detection, reconciliationWallet ServiceBalance management, append-only ledger, ACID transactions

Solution 1: Circuit Breaker

The pattern is simple. Track failures. After a threshold, stop calling the failing service entirely. After a timeout, test with one request. If it succeeds, resume normal operation.

Three states: CLOSED (normal), OPEN (fast-failing), HALF_OPEN (testing recovery).

class CircuitBreaker {
async execute(fn) {
if (this.state === 'OPEN') {
const elapsed = Date.now() - this.lastFailureTime;
if (elapsed < this.options.recoveryTimeout) {
throw new Error(
`Circuit OPEN. Retry after ${
Math.ceil((this.options.recoveryTimeout - elapsed) / 1000)
}s`
);
}
this.state = 'HALF_OPEN';
}
    try {
const result = await this.executeWithTimeout(fn);
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
throw err;
}
}
}

I tested this under a real wallet service outage. The full lifecycle:

CLOSED → OPEN → HALF_OPEN → CLOSED

After the wallet service recovered, two successful payments brought the circuit back to closed with uptime: 100%.

Key insight: The circuit breaker sits outside the retry loop. If the circuit is open, we don’t even attempt retries. This is the order that matters.

Solution 2: Smart Retry with Error Classification

The most important thing about retries is knowing when not to retry.

const NON_RETRYABLE = [
'insufficient_balance',
'card_expired',
'account_blocked',
'fraud_detected',
'wallet_not_found',
];
const RETRYABLE = [
'timeout',
'econnrefused',
'service_unavailable',
'503', '429',
];
function classifyError(error) {
const msg = error.message.toLowerCase();
if (NON_RETRYABLE.some(p => msg.includes(p))) return 'NON_RETRYABLE';
if (RETRYABLE.some(p => msg.includes(p))) return 'RETRYABLE';
return 'UNKNOWN';
}

Non-retryable errors abort immediately. Retryable errors get exponential backoff with jitter.

Jitter is critical, without it, all retrying services hit the recovering service at the same moment, creating a thundering herd.

const delay = opts.jitter
? cappedDelay * (0.5 + Math.random() * 0.5)
: cappedDelay;

Solution 3: Reconciliation Engine

This is the one most teams skip. It is also the one that causes the most expensive production incidents.

The reconciliation engine runs across all payments in a time window and verifies each one against the wallet ledger:

javascript

async _checkPayment(payment) {
// COMPLETED payment must have ledger entry
if (status === 'COMPLETED') {
const ledgerExists = await this.walletClient.verifyLedgerEntry(
walletId,
payment.gatewayTransactionId
);
if (!ledgerExists) {
return {
status: 'MISMATCH',
reason: 'COMPLETED_LEDGER_MISSING',
severity: 'CRITICAL',
};
}
}
  // FAILED payment must NOT have ledger entry
if (status === 'FAILED' && payment.gatewayTransactionId) {
const ledgerExists = await this.walletClient.verifyLedgerEntry(
walletId,
payment.gatewayTransactionId
);
if (ledgerExists) {
return {
status: 'MISMATCH',
reason: 'FAILED_BUT_DEBITED',
severity: 'CRITICAL', // Double charge risk
};
}
}
}

Four mismatch types detected:

🔴 CRITICAL

🟡 MEDIUM

Run it on demand:

GET /api/payments/reconcile/run?from=2026-01-01&to=2026-01-02

Result on 7 payments: matched: 7, mismatched: 0, durationMs: 17.

Solution 4: Anomaly Detection

Rule-based, Redis-backed, runs on every payment creation. Four rules:

const rules = {
velocityLimit: 5, // >5 payments from same user in 60s
velocityWindowSec: 60,
largeAmountThreshold: 5000, // Single payment >5000
failedStreakLimit: 3, // >3 consecutive failures same user
duplicateWindowSec: 300, // Same amount 3x in 5 minutes
duplicateCountLimit: 3,
};

I tested this with a velocity attack, 6 rapid payments from the same user. The detector flagged payment 5 at severity: HIGH:

json

{
"flagged": true,
"reason": "VELOCITY_EXCEEDED: 5 payments in 60s",
"severity": "HIGH"
}

Payment 6 triggered both VELOCITY_EXCEEDED and DUPLICATE_AMOUNT simultaneously.

Important design decision: Anomaly detection is non-blocking. The payment still gets created but gets flagged in logs and is queryable via API. This is the right trade-off, you don’t want to block legitimate payments due to false positives, but you do want visibility.

bash

GET /api/payments/anomaly/check/:userId

Performance

Load tested with k6, 50 concurrent virtual users, 2-minute ramp:

MetricResultThroughput51 req/secp50 latency9msp90 latency19msp95 latency32msMax latency177ms

What I Would Do Differently

Kafka instead of Redis Pub/Sub for events. Redis Pub/Sub is fire-and-forget — if a consumer is down, the event is lost. For financial events, you want Kafka’s durability and replay capability.

Scheduled reconciliation. Right now reconciliation is on-demand. A production system needs it running every 5 minutes automatically, with alerts piped to Slack or PagerDuty.

ML-based anomaly detection. Rule-based detection catches known patterns. ML-based detection catches unknown ones. The rules are a starting point, not an endpoint.

Everything is open source.

GitHub: github.com/Infinimus01/distributed-payment-system-

Stack: Node.js · PostgreSQL · Redis · Docker

If you have feedback on the architecture or spot something I got wrong, drop a comment. I’m genuinely interested.

Amlendu Pandey — Backend Engineer

LinkedIn: linkedin.com/in/amlendupandey16

GitHub: github.com/Infinimus01

Looking for a crypto payment gateway?

NexaPay lets merchants accept card payments and receive crypto. No KYC required. Instant settlement via Visa, Mastercard, Apple Pay, and Google Pay.

Learn More →
This article was originally published on Fintech Tag and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →