I Built a Production-Grade Distributed Payment System — Here’s What I Learned About Fintech Reliability
--
“Circuit breakers, smart retries, reconciliation engines, and anomaly detection — built from scratch in Node.js”
Every fintech engineering team I’ve studied builds the same four things from scratch. And every time, at least one of them breaks in production in an expensive way.
I decided to build them properly, once, and document every decision. This is that write-up.
The Problem Statement
Here are the four failure modes I set out to solve:
1. Cascading failures
Your payment service calls your wallet service. Wallet service goes down. Now your payment service is also effectively down, because every request is waiting 10 seconds for a timeout. One service failure becomes a system failure.
2. Unintelligent retries
Your retry logic retries everything. Network error? Retry. Insufficient funds? Also retry. The second case charges the customer twice. This is how fintech companies lose money and users.
3. Silent reconciliation mismatches
A payment gets marked COMPLETED in your database. But the wallet debit never happened — maybe the wallet service crashed mid-transaction. You have no way to know. The customer was charged, the money never moved, and your support team finds out three days later.
4. Undetected fraud patterns
Same user, 10 payments in 30 seconds. Card testing attack. Nobody noticed until the chargebacks arrived.
The Architecture
Three decoupled microservices:
- API Gateway — Auth, rate limiting, request routing
- Payment Service — Payment lifecycle, all four reliability patterns
- Wallet Service — Balance management, append-only ledger, ACID transactions
Infrastructure: PostgreSQL (separate databases per service), Redis (idempotency, locking, anomaly tracking), Docker.
ServiceResponsibilityAPI GatewayAuth, rate limiting, request routingPayment ServicePayment lifecycle, circuit breaker, anomaly detection, reconciliationWallet ServiceBalance management, append-only ledger, ACID transactions
Solution 1: Circuit Breaker
The pattern is simple. Track failures. After a threshold, stop calling the failing service entirely. After a timeout, test with one request. If it succeeds, resume normal operation.
Three states: CLOSED (normal), OPEN (fast-failing), HALF_OPEN (testing recovery).
class CircuitBreaker {
async execute(fn) {
if (this.state === 'OPEN') {
const elapsed = Date.now() - this.lastFailureTime;
if (elapsed < this.options.recoveryTimeout) {
throw new Error(
`Circuit OPEN. Retry after ${
Math.ceil((this.options.recoveryTimeout - elapsed) / 1000)
}s`
);
}
this.state = 'HALF_OPEN';
} try {
const result = await this.executeWithTimeout(fn);
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
throw err;
}
}
}I tested this under a real wallet service outage. The full lifecycle:
CLOSED → OPEN → HALF_OPEN → CLOSEDAfter the wallet service recovered, two successful payments brought the circuit back to closed with uptime: 100%.
Key insight: The circuit breaker sits outside the retry loop. If the circuit is open, we don’t even attempt retries. This is the order that matters.
Solution 2: Smart Retry with Error Classification
The most important thing about retries is knowing when not to retry.
const NON_RETRYABLE = [
'insufficient_balance',
'card_expired',
'account_blocked',
'fraud_detected',
'wallet_not_found',
];const RETRYABLE = [
'timeout',
'econnrefused',
'service_unavailable',
'503', '429',
];function classifyError(error) {
const msg = error.message.toLowerCase();
if (NON_RETRYABLE.some(p => msg.includes(p))) return 'NON_RETRYABLE';
if (RETRYABLE.some(p => msg.includes(p))) return 'RETRYABLE';
return 'UNKNOWN';
}
Non-retryable errors abort immediately. Retryable errors get exponential backoff with jitter.
Jitter is critical, without it, all retrying services hit the recovering service at the same moment, creating a thundering herd.
const delay = opts.jitter
? cappedDelay * (0.5 + Math.random() * 0.5)
: cappedDelay;Solution 3: Reconciliation Engine
This is the one most teams skip. It is also the one that causes the most expensive production incidents.
The reconciliation engine runs across all payments in a time window and verifies each one against the wallet ledger:
javascript
async _checkPayment(payment) {
// COMPLETED payment must have ledger entry
if (status === 'COMPLETED') {
const ledgerExists = await this.walletClient.verifyLedgerEntry(
walletId,
payment.gatewayTransactionId
);
if (!ledgerExists) {
return {
status: 'MISMATCH',
reason: 'COMPLETED_LEDGER_MISSING',
severity: 'CRITICAL',
};
}
} // FAILED payment must NOT have ledger entry
if (status === 'FAILED' && payment.gatewayTransactionId) {
const ledgerExists = await this.walletClient.verifyLedgerEntry(
walletId,
payment.gatewayTransactionId
);
if (ledgerExists) {
return {
status: 'MISMATCH',
reason: 'FAILED_BUT_DEBITED',
severity: 'CRITICAL', // Double charge risk
};
}
}
}Four mismatch types detected:
🔴 CRITICAL
COMPLETED_LEDGER_MISSING— Payment marked completed, wallet never debitedFAILED_BUT_DEBITED— Payment failed but wallet was charged (double charge risk)
🟡 MEDIUM
STUCK_PENDING— Payment pending for more than 10 minutesSTUCK_PROCESSING— Payment stuck in processing for more than 5 minutes
Run it on demand:
GET /api/payments/reconcile/run?from=2026-01-01&to=2026-01-02Result on 7 payments: matched: 7, mismatched: 0, durationMs: 17.
Solution 4: Anomaly Detection
Rule-based, Redis-backed, runs on every payment creation. Four rules:
const rules = {
velocityLimit: 5, // >5 payments from same user in 60s
velocityWindowSec: 60,
largeAmountThreshold: 5000, // Single payment >5000
failedStreakLimit: 3, // >3 consecutive failures same user
duplicateWindowSec: 300, // Same amount 3x in 5 minutes
duplicateCountLimit: 3,
};I tested this with a velocity attack, 6 rapid payments from the same user. The detector flagged payment 5 at severity: HIGH:
json
{
"flagged": true,
"reason": "VELOCITY_EXCEEDED: 5 payments in 60s",
"severity": "HIGH"
}Payment 6 triggered both VELOCITY_EXCEEDED and DUPLICATE_AMOUNT simultaneously.
Important design decision: Anomaly detection is non-blocking. The payment still gets created but gets flagged in logs and is queryable via API. This is the right trade-off, you don’t want to block legitimate payments due to false positives, but you do want visibility.
bash
GET /api/payments/anomaly/check/:userIdPerformance
Load tested with k6, 50 concurrent virtual users, 2-minute ramp:
MetricResultThroughput51 req/secp50 latency9msp90 latency19msp95 latency32msMax latency177ms
What I Would Do Differently
Kafka instead of Redis Pub/Sub for events. Redis Pub/Sub is fire-and-forget — if a consumer is down, the event is lost. For financial events, you want Kafka’s durability and replay capability.
Scheduled reconciliation. Right now reconciliation is on-demand. A production system needs it running every 5 minutes automatically, with alerts piped to Slack or PagerDuty.
ML-based anomaly detection. Rule-based detection catches known patterns. ML-based detection catches unknown ones. The rules are a starting point, not an endpoint.
Everything is open source.
GitHub: github.com/Infinimus01/distributed-payment-system-
Stack: Node.js · PostgreSQL · Redis · Docker
If you have feedback on the architecture or spot something I got wrong, drop a comment. I’m genuinely interested.
Amlendu Pandey — Backend Engineer
LinkedIn: linkedin.com/in/amlendupandey16
GitHub: github.com/Infinimus01