Enterprise Architecture6 May 2026 · 8 min read
Event-Driven Backend Patterns for Retail Inventory
Real-time inventory is the hardest consistency problem in retail. Here is how to build an event-driven backend that survives duplicate deliveries, schema drift and channel drift.
Tom Fitzgerald
Staff Backend Engineer
Inventory is the one number every channel in a retail business fights over. The point of sale wants to know if it can complete a transaction right now. The e-commerce storefront wants to avoid overselling. The warehouse wants an accurate picture of what is physically on the shelf. All three are reading and writing the same logical quantity from different systems, on different networks, at different times. This is a distributed consistency problem dressed up as a business feature.
The instinct is to reach for a single source of truth and have everyone call it synchronously. That works until the network between a store and the datacentre wobbles, or the warehouse management system takes a maintenance window, and suddenly you cannot sell anything anywhere. Event-driven architecture is the standard answer, but it is not free. It trades synchronous coupling for a set of failure modes that will absolutely bite you if you treat the broker as a magic pipe. This piece is about the patterns that make it hold together.
The outbox pattern is non-negotiable
The most common bug I see in event-driven inventory systems is a dual-write. A service updates the database, then publishes an event to the broker. These are two separate systems with no shared transaction. If the process dies between the commit and the publish, you have changed stock levels but nobody downstream knows. Do it the other way around and you can publish an event describing a change that then rolls back.
The transactional outbox fixes this by writing the event into the same database, in the same transaction, as the state change. A separate relay process reads the outbox table and publishes to the broker, marking rows as sent. The database commit is the single atomic point. Either the stock movement and its event both exist, or neither does.
await db.transaction(async (tx) => {
await tx.stockLevels.decrement({ sku, location, qty });
await tx.outbox.insert({
id: crypto.randomUUID(),
aggregateId: sku,
type: "StockReserved",
payload: { sku, location, qty, orderId },
occurredAt: new Date(),
});
});You can drive the relay with change data capture off the outbox table, or a simple polling loop with a claimed-at column. CDC scales better and adds no query load, but a poller is trivial to reason about and perfectly fine at store-level volumes. Do not skip the outbox because the poller feels crude; the crude version that is correct beats the elegant dual-write that loses events.
Idempotency, and why exactly-once is a story we tell ourselves
Every real broker delivers at-least-once. Retries, consumer rebalances, relay restarts after a publish that succeeded but was not acknowledged: all of these produce duplicates. Kafka's transactional producer gives you exactly-once semantics within Kafka, but the moment your consumer writes to an external database or calls a warehouse API, that guarantee stops at the boundary. What you actually want is effectively-once: at-least-once delivery plus idempotent consumers.
The practical rule is that every event carries a stable identifier, and every consumer records which identifiers it has already applied before acting. A processed-events table with a unique constraint on the event ID, written in the same transaction as the side effect, turns a duplicate delivery into a no-op.
Be careful with operations that are not naturally idempotent. "Decrement stock by 5" applied twice is a bug. "Set stock for order 1234 to reserved" applied twice is safe. Where you can, model events as absolute state transitions keyed by a business identifier rather than relative deltas. Where you cannot, lean on the dedupe table.
Event schemas and versioning
Your events are a public API the moment a second team consumes them. Treat them with the same discipline. Use a schema registry with a concrete format such as Avro or Protobuf, enforce backward compatibility on the registry, and make additive change the default: new fields are optional with defaults, old fields are never repurposed.
- Never change the meaning of an existing field. If semantics change, add a new field or a new event type.
- Keep events fat enough to be useful but avoid dumping entire aggregates; downstream coupling grows with every field you expose.
- Version the event type name (StockAdjusted.v2) only for breaking changes, and run both versions in parallel until consumers migrate.
- Include occurredAt and a producer identifier on every event so consumers can order and audit without guessing.
The failure mode here is silent: a producer adds a required field, an old consumer deserialises garbage or drops the message, and stock quietly drifts. Compatibility checks in the registry are what stop a schema change from becoming an inventory incident three days later.
Eventual consistency across channels
Once you accept events, you accept that channels are eventually consistent. The e-commerce read model, the POS cache and the warehouse system will disagree for windows measured in milliseconds to seconds, occasionally minutes during a backlog. The engineering job is to make those windows bounded and safe rather than pretending they do not exist.
The most useful concept is the reservation. Instead of a single quantity, model available-to-sell as on-hand minus reserved. When an order comes in, you reserve against a location; the physical decrement happens later when the warehouse confirms the pick. This gives channels a safe number to sell against without waiting for the physical world to catch up, and it makes overselling a deliberate policy decision (oversell buffers) rather than an accident.
Partitioning for ordering
Order matters per SKU-location, not globally. Partition your topics by a key like SKU or SKU-plus-location so that all events for one item land on one partition and are processed in order by one consumer. Global ordering is expensive and you almost never need it. Getting the partition key right is the difference between a consumer that can scale out and one that is forced to be single-threaded.
Reconciliation is a first-class feature
Even with the outbox and idempotency, streams drift. Events get skipped during a bad deploy, a consumer bug misapplies a batch, a warehouse does a manual stock count that never emitted events. You need a reconciliation process that periodically compares the event-derived state against the authoritative physical count and emits correction events for the deltas.
Treat reconciliation output as ordinary events flowing through the same pipeline, not as a side-channel database patch. That keeps every consumer consistent and preserves the audit trail. Run it on a cadence that matches the cost of being wrong: nightly for slow-moving lines, more frequently for high-velocity SKUs. The reconciliation job is also your best early-warning system; a growing delta is a symptom of a pipeline bug before customers ever notice.
Choosing between Kafka, Service Bus and plain queues
There is no universal right answer, but there are clear fits. Reach for the tool that matches your ordering, replay and operational needs rather than the one with the best conference talks.
- Kafka (or MSK) when you need partitioned ordering, high throughput, event replay and multiple independent consumer groups reading the same stream. This is the natural fit for an inventory event backbone.
- Azure Service Bus or similar when you want managed queues and topics with sessions for ordering, dead-letter queues out of the box and lower operational overhead, and your volumes are moderate.
- A plain queue (SQS, RabbitMQ) when the work is task-oriented, order across items does not matter, and you value simplicity. Perfect for downstream jobs like reindexing search or sending notifications.
A pattern that ages well is Kafka as the ordered event backbone, with lightweight queues hanging off consumers for fan-out work that does not need ordering or replay. Do not run Kafka because it is fashionable; if a single team consumes a modest event stream, managed Service Bus will cost you far less in operational time.
Failure modes to plan for
- Consumer lag during peak: back-pressure and horizontal scaling of consumer groups, with alerting on lag rather than on error rate alone.
- Poison messages: a dead-letter queue plus a documented replay path, so one malformed event does not stall a partition.
- Relay outages: monitor outbox depth; a rising unsent count means events are being written but not published.
- Duplicate storms after a rebalance: verified only by an idempotent consumer, which is why the dedupe table is not optional.
- Schema drift: registry compatibility checks in CI, not discovered in production.
None of this is exotic. It is the same handful of patterns applied consistently. The teams that succeed with event-driven inventory are not the ones with the cleverest architecture; they are the ones who took idempotency, the outbox and reconciliation seriously from day one, and who treated eventual consistency as a design constraint to be bounded rather than a bug to be hidden.