How to Collect Server Events Across Hundreds of Microservices

At five microservices, event collection is a straightforward problem. Every service publishes to Kafka. Someone writes a shared library. You move on.

At twenty services, it becomes an organizational problem disguised as a technical one. The question stops being “which message broker should we use?” and becomes “who is responsible for making sure that event actually makes it there — with the right schema, the right topic name, the right retry semantics?”

We went through three approaches before we landed on something stable. This is the comparison.

Approach 1: Every service owns its own publishing

The default. Each team decides when and how their service publishes events. They pick the schema, define the topic-naming convention, write the retry logic, and handle serialization. Feels right for microservices — autonomy, independent deployment, each team moves at their own pace.

What actually happened: we ended up with five different schemas for what was supposed to be the same event, because three people had written their publishers independently over eighteen months. Two teams used camelCase for field names. Two used snake_case. One team was publishing every internal database row change as a separate event, flooding topics that downstream consumers were trying to use for business-level signals.

When someone needed to consume events from a service, there was no documentation on what events existed or what their schemas looked like. They found out by reading the producer code directly.

Ownership in name. Chaos in practice.

Approach 2: One team owns event publishing for everyone

The obvious overcorrection. We created a dedicated event platform team. Their remit: maintain the schema registry, review all new event proposals, publish and maintain a shared Go library for event publishing, own the Kafka cluster and topic configuration.

The bottleneck arrived within six weeks. Everyone was raising tickets to register new event types. The event platform team was reviewing schema proposals for domains they didn’t understand — they were approving field names without knowing whether the business logic behind them was right. A schema change request that the service team needed in a day was taking three days to move through review.

The ownership was clear. The velocity was gone.

Approach 3: Sidecar collector agents, schema registry, no publishing team

What we landed on was a different model. Services do not publish events directly. They emit structured logs in a documented format. We came across Vector for the collector agent — running alongside each service, it reads those logs, validates them against a schema stored in a central registry, and forwards the valid events to Kafka.

The result: service teams never write Kafka publishing code. They write logs. The schema registry is a shared Git repository. New event types are added via a pull request against that repository — peer-reviewed by whoever is affected, not by a specialized team that has to understand every domain.

If a log doesn’t pass schema validation, the agent rejects it immediately, and the service deployment fails its health check. The service team gets the error in their own pipeline, not as a surprise reported by a downstream consumer three days later.

The platform team’s job became running the collector agent infrastructure. They no longer gatekeep schemas. The organizational boundary matched the knowledge boundary: service teams own their schemas because they understand their domain; the platform team owns the transport because they understand Kafka.

What we learned about ownership

We ran all three approaches before landing on the third, and the pattern was consistent: wherever you put the ownership, the boundary of that ownership has to match the boundary of the knowledge.

Approach 1 failed because it gave teams too much transport responsibility alongside domain responsibility — schema decisions and Kafka configuration are different kinds of problems, and mixing them produced inconsistency.

Approach 2 failed because it gave transport specialists domain responsibility — the event platform team was approving schemas for business concepts they didn’t own.

Approach 3 worked because it separated the two concerns cleanly. Domain teams own what events mean. Infrastructure teams own how events travel. The agent is the seam between them.

Event collection at scale is an organizational architecture problem. The technical implementation follows from getting that right.

How to Collect Server Events Across Hundreds of Microservices was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.