Building a Blockchain Transfer Orchestrator in Go
What I learned about workflow state, retry, webhook delivery, and reconciliation
Khalid Alhabibie10 min read·Just now--
When people talk about blockchain development, the conversation usually jumps straight to smart contracts, wallets, gas fees, private keys, or transaction hashes.
And honestly, I get it.
Those parts sound cooler.
Nobody opens a blockchain article hoping to read about retry workers, webhook status tables, or reconciliation jobs.
That sounds less like Web3 and more like backend engineers quietly fighting production ghosts at 11 PM.
But the more I thought about backend systems that interact with blockchain networks, the more I realized something:
The difficult part is not only sending a transaction.
The difficult part is managing everything around it.
For example:
What happens if the blockchain node is slow?
What happens if the backend broadcasts a transaction, but crashes before saving the transaction hash?
What happens if the transaction is still pending after several minutes?
What happens if the webhook delivery fails, but the transfer itself is already confirmed?
What happens if everything looks fine in the API response, but the internal state is quietly lying to you?
That last one is not a bug.
That is a horror story with a database connection.
That was one of the reasons I built go-aegis, a small Go project where I explored blockchain transfer orchestration from a backend reliability perspective.
Not as a perfect production system.
Not as an exchange-grade platform.
And definitely not as a “blockchain will solve everything” kind of project.
More like:
“Okay, what actually needs to happen so this backend does not embarrass itself when the real world becomes messy?”
A Blockchain Transfer Is Not Just One API Call
In many backend systems, we are used to this kind of flow:
Client → Backend → Database → ResponseThe client sends a request.
The backend processes it.
The backend returns a response.
Clean.
Beautiful.
Suspiciously peaceful.
For many use cases, that is enough.
But blockchain transfers are different.
When a backend submits a transaction to a blockchain network, the result is not always final immediately.
Sometimes the backend only receives a transaction hash.
And a transaction hash does not always mean the transfer is complete.
It usually means:
“The transaction was submitted. Now wait and pray the network behaves.”
Okay, maybe not pray.
But you get the idea.
That small difference changes the backend design.
Because now the system has to deal with states like:
PENDING
BROADCASTED
CONFIRMING
CONFIRMED
FAILED
UNKNOWNThis is where backend problems usually begin.
If we treat blockchain transfers like normal synchronous API calls, we hide too much complexity inside one request.
And hidden complexity has a funny habit.
It waits quietly during development, passes the happy path demo, and then shows up in production like it owns the place.
The System Needs Memory
One thing I wanted to make clear in this project is workflow state.
I do not want the system to only know whether a transfer is success or failed.
That is too simple.
In real systems, we often need to know where something failed.
Did it fail during validation?
Did it fail before being queued?
Did it fail while broadcasting the transaction?
Was the transaction already sent, but confirmation is still pending?
Was the transfer confirmed, but webhook delivery failed?
Those are different problems.
So the system needs more specific states.
For example:
CREATED
VALIDATED
QUEUED
BROADCASTING
BROADCASTED
CONFIRMING
CONFIRMED
FAILED
RECONCILEDThis may look like extra work.
And yes, it is extra work.
But it is the kind of extra work that saves your future self from opening logs, drinking cold coffee, and whispering:
“Why is this still pending?”
A clear workflow state gives the system memory.
Without it, debugging becomes a guessing game.
And guessing is not a strong engineering strategy, especially when the system is moving assets.
The Basic Flow I Had in Mind
The flow I wanted to model is something like this:
Client
↓
Create transfer request
↓
Store transfer as CREATED
↓
Validate request
↓
Queue transfer job
↓
Worker broadcasts transaction
↓
Store transaction hash
↓
Check confirmation
↓
Send webhook
↓
Run reconciliationOn paper, this looks simple.
But every box in that flow has its own way of ruining your day.
The API can fail.
The database can fail.
The queue can fail.
The worker can crash.
The blockchain node can timeout.
The webhook receiver can return 500.
And somehow, the user will still expect a clear answer.
That is fair.
The system should be able to explain what happened.
This is why I prefer to make the workflow explicit instead of hiding everything inside one big function.
A big function may look clean on day one.
But when something goes wrong, a clear workflow is much easier to reason about.
A Simple Transfer Table
A simplified transfer table can look like this:
CREATE TABLE blockchain_transfers (
id UUID PRIMARY KEY,
request_id VARCHAR(100) NOT NULL UNIQUE,
from_address VARCHAR(255) NOT NULL,
to_address VARCHAR(255) NOT NULL,
asset_symbol VARCHAR(20) NOT NULL,
amount NUMERIC(30, 18) NOT NULL,
status VARCHAR(50) NOT NULL,
tx_hash VARCHAR(255),
failure_reason TEXT,
confirmation_count INT DEFAULT 0,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);The important part here is not only tx_hash.
Of course, the transaction hash matters.
But the internal status matters too.
The transaction hash tells us what happened on-chain.
The internal status tells us what happened inside our system.
In a reliable backend, we usually need both.
Because if the blockchain says one thing and your database says another, congratulations.
You now have a reconciliation problem.
Not the worst problem in the world, but definitely not the kind you want to discover from a customer complaint.
Why I Would Not Do Everything Inside the HTTP Request
One decision I would avoid is making the HTTP request do too much.
For example, this kind of flow looks simple:
POST /transfers
↓
Validate request
↓
Sign transaction
↓
Broadcast transaction
↓
Wait for confirmation
↓
Send webhook
↓
Return responseIt looks productive.
It also looks like a future timeout incident wearing a nice shirt.
For blockchain transfers, I do not like this flow.
The request becomes too heavy.
It can timeout.
Retry becomes harder.
The user-facing API becomes coupled with slow external processes.
So I prefer this kind of approach:
POST /transfers
↓
Create transfer record
↓
Publish job to queue
↓
Return responseThen a worker handles the dangerous part:
Worker
↓
Pick transfer job
↓
Broadcast transaction
↓
Update status
↓
Check confirmation laterThis makes the API simpler and faster.
It also gives the backend more control over retry, failure handling, and recovery.
The queue can be RabbitMQ, Kafka, SQS, Pub/Sub, or something else.
The tool can change.
The pattern is the important part:
Accept the request quickly.
Process the risky workflow carefully.
Retry Is Useful, But It Can Also Hurt You
Retry is one of those backend patterns that sounds safe.
If something fails, try again.
Simple.
Peaceful.
Dangerous.
In transfer systems, retry can create real problems if it is not designed carefully.
Imagine this:
1. Backend broadcasts the transaction successfully.
2. Blockchain node times out before returning a response.
3. Backend thinks the broadcast failed.
4. Backend retries the broadcast.
5. Another transaction may be created.Now the system may have sent the same transfer twice.
That is not “oops”.
That is a meeting.
Probably with several people.
Maybe with finance.
Maybe with someone asking why the dashboard is green while the money is not.
So retry should not be blind.
Some failures are usually safer to retry:
Temporary network error
Queue timeout
Webhook receiver unavailable
Confirmation checker timeoutBut some failures need more careful handling:
Unknown broadcast result
Nonce conflict
Transaction already known
Insufficient funds
Invalid signature
Gas estimation failureFor me, the most dangerous state is not always FAILED.
Sometimes it is UNKNOWN.
Because when the state is unknown, the system does not know whether it is safe to repeat the operation.
And when money or assets are involved, uncertainty is not just uncomfortable.
It is expensive.
Duplicate Processing Can Happen
Another thing I always try to remember:
The same job can be processed more than once.
Not because the developer is careless.
But because distributed systems behave like distributed systems.
A worker can crash.
A queue can redeliver a message.
A retry can happen.
Two workers can accidentally pick the same job.
Somewhere, somehow, the same transfer may knock on the door twice.
So the system should not assume that a job is only executed once.
One simple protection is using a processing lock.
For example:
transfer:{transfer_id}:processingBefore a worker processes a transfer, it tries to acquire a lock.
If the lock already exists, another worker is processing that transfer.
A simplified Go example:
func (s *TransferService) ProcessTransfer(ctx context.Context, transferID string) error {
lockKey := "transfer:" + transferID + ":processing"locked, err := s.lock.Acquire(ctx, lockKey, 30*time.Second)
if err != nil {
return err
} if !locked {
return nil
} defer s.lock.Release(ctx, lockKey) transfer, err := s.transferRepo.FindByID(ctx, transferID)
if err != nil {
return err
} if transfer.Status == "BROADCASTED" || transfer.Status == "CONFIRMED" {
return nil
} return s.broadcastTransfer(ctx, transfer)
}
This is not the only way.
We can also use database row locking.
We can use queue-level deduplication.
We can use idempotency keys at the API level.
But the mindset is the same:
Assume duplicate execution can happen.
Design the system so it does not create duplicate damage.
Webhook Delivery Should Have Its Own State
Another part that is easy to underestimate is webhook delivery.
After a transfer is confirmed, another system may need to be notified.
For example:
Transfer confirmed → send webhook to client systemSimple, right?
That is how production traps start.
The receiver can be down.
The network can timeout.
The receiver can return 500.
The receiver may actually process the webhook but fail to return a proper response.
So I do not like treating webhook delivery as a simple HTTP call inside the main transfer process.
Webhook delivery should have its own lifecycle.
For example:
PENDING
DELIVERING
DELIVERED
FAILED
RETRYINGA simple table can look like this:
CREATE TABLE webhook_deliveries (
id UUID PRIMARY KEY,
transfer_id UUID NOT NULL,
target_url TEXT NOT NULL,
event_type VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
status VARCHAR(50) NOT NULL,
attempt_count INT DEFAULT 0,
last_error TEXT,
next_retry_at TIMESTAMP,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);This table helps the system answer a very practical question:
Did we notify the client?
Without this, the answer often becomes:
Let me check the logs first.
Logs are useful.
But important business workflows should also be visible from the data model.
Because logs are where developers go to suffer.
The database should help too.
Reconciliation Is the Safety Net
One lesson I keep repeating to myself is this:
A backend system should not trust itself too much.
Especially when it talks to external systems.
Blockchain is external.
Payment gateway is external.
Banking partner is external.
Even another internal service can behave like an external dependency when it fails or changes unexpectedly.
Our database may say the transfer is still pending.
But maybe the blockchain already confirmed it.
Our system may say webhook delivery failed.
But maybe the receiver already processed it.
This is why reconciliation matters.
Reconciliation is the process of comparing internal state with the external source of truth.
For blockchain transfers, reconciliation can check:
Internal transfer status
Transaction hash
On-chain transaction status
Confirmation count
Block number
Amount
From address
To address
AssetA simplified reconciliation flow:
Find transfers not finalized
↓
Check blockchain by transaction hash
↓
Compare on-chain result with internal status
↓
Update internal status if needed
↓
Create reconciliation recordExample pseudo-code:
func (s *ReconciliationService) ReconcileTransfer(ctx context.Context, transferID string) error {
transfer, err := s.transferRepo.FindByID(ctx, transferID)
if err != nil {
return err
}if transfer.TxHash == "" {
return s.handleMissingTransactionHash(ctx, transfer)
} onchainTx, err := s.chainClient.GetTransaction(ctx, transfer.TxHash)
if err != nil {
return err
} if onchainTx.IsConfirmed() && transfer.Status != "CONFIRMED" {
transfer.Status = "CONFIRMED"
transfer.ConfirmationCount = onchainTx.ConfirmationCount return s.transferRepo.Update(ctx, transfer)
} if onchainTx.IsFailed() && transfer.Status != "FAILED" {
transfer.Status = "FAILED"
transfer.FailureReason = onchainTx.FailureReason return s.transferRepo.Update(ctx, transfer)
} return nil
}
This kind of logic may not look impressive in a demo.
Nobody claps because a reconciliation job corrected a mismatch.
But in production, that is exactly the kind of boring system behavior that keeps people calm.
And honestly, calm is underrated.
Observability Helps the System Explain Itself
For a transfer orchestrator, logs alone are not enough.
I want to know things like:
How many transfers are still pending?
How many failed during broadcast?
How long does confirmation usually take?
How many webhooks are retrying?
How many jobs are stuck?
How many reconciliation mismatches were found?Some metrics that can help:
blockchain_transfer_created_total
blockchain_transfer_broadcasted_total
blockchain_transfer_confirmed_total
blockchain_transfer_failed_total
blockchain_transfer_confirmation_duration_seconds
webhook_delivery_failed_total
webhook_delivery_retry_total
reconciliation_mismatch_totalThe goal is not to make the dashboard look fancy.
The goal is to make operational problems visible earlier.
Because when a transfer is stuck, I do not want the team to find out only after a user complains.
A good system should raise its hand before the customer does.
What I Learned From Building This
Building go-aegis reminded me that blockchain backend development is not only about blockchain.
The backend still needs classic reliability patterns:
Workflow state
Idempotency
Queue-based processing
Retry control
Duplicate protection
Webhook tracking
Reconciliation
ObservabilityThe blockchain part may be the most interesting part from the outside.
But from the backend side, the surrounding system is what makes the product reliable.
A transaction hash is not enough.
A success response is not enough.
A queue is not enough.
A database row is not enough.
The system needs to understand what happened, what is happening now, and what should happen next.
That is the real job of an orchestrator.
Final Thought
If I had to summarize the lesson from this project, it would be this:
A blockchain backend should not only send transactions.
It should orchestrate uncertainty.
Because uncertainty is everywhere.
Maybe the transaction was sent.
Maybe the node timed out.
Maybe the worker crashed.
Maybe the webhook failed.
Maybe the external state changed.A reliable backend does not pretend these cases do not exist.
It designs for them.
That is what I wanted to explore with go-aegis.
Not just how to send a blockchain transaction, but how to build a backend that can survive the messy parts around it.
And if the system can survive the messy parts, then maybe the backend engineer can sleep a little better too.
Not guaranteed.
But at least the database has a status column.
Project Repository
I also built a small Go project to explore this idea in code:
go-aegis
A blockchain orchestration backend in Go for wallet-based transfers, transaction lifecycle management, event indexing, webhook delivery, and reconciliation.