Keeping one slow provider from stalling everyone

Closed a webhook-vs-event-store race, killed head-of-line blocking with per-provider lane isolation, and added a targeted cache that cut analytic-store query load 85–95% — on a service that has delivered 53M+ SMS.

Context

A cross-product Go service is the SMS backbone behind Dobare and its sister products — it has delivered 53M+ messages to 8M+ recipients, with bursts peaking around 8,000 messages a minute. At that volume, reliability problems that would be invisible in a toy system become daily operational pain. Three were hurting the most: a data race on inbound status, one slow provider stalling everyone, and an analytic store buckling under read load.

Constraints

Multiple SMS providers, none of them reliable in the same way. Each has its own latency profile, failure modes, and webhook quirks.
Status is eventually consistent by nature. Delivery receipts arrive asynchronously, out of order, and sometimes more than once.
It’s shared infrastructure. Several products consume it, so a regression here is a multi-product incident.

Approach

The race. Inbound delivery webhooks and the event store were competing: a webhook could land and try to update a message’s status before — or concurrently with — the write that recorded the message existed. I reordered and guarded the write path so the event store is the authority and late or duplicate webhooks resolve deterministically instead of racing.

Head-of-line blocking. Providers shared processing capacity, so when one degraded, its backlog throttled traffic for every other provider. I isolated each provider onto its own lane — independent Kafka topics, dead-letter queues, and retry policy with exponential backoff per provider — so a slow upstream contains its own damage.

Read load. A large share of analytic-store traffic was a small set of hot, repeated query patterns. I put a targeted cache layer in front of those specific patterns rather than caching indiscriminately.

            ┌────────────┐
   inbound ▶│ dispatcher │
            └─────┬──────┘
       ┌──────────┼──────────┐
       ▼          ▼          ▼
    [lane A]   [lane B]   [lane C]   topic·DLQ·retry
       ▼          ▼          ▼
    provider   provider   provider
       A        B (slow)     C

Decision — provider isolation over a shared pool with priorities. Priority queues would have been less code, but they only reorder contention; they don’t remove it. One persistently slow provider would still degrade the rest. Physical lane isolation per provider — topics, DLQs, retries — makes one provider’s failure structurally unable to block the others.

Decision — cache the hot patterns, not the store. A blanket cache invites staleness and cache-stampede problems across the whole query surface. Scoping the cache to the few hottest patterns captured almost all of the benefit for a fraction of the correctness risk.

Outcome

Analytic-store query load dropped 85–95%, the status race was closed, and the service stopped stalling when a single provider degraded. The per-provider lane model also made provider behavior observable in isolation, which turned “messaging is slow” into “provider X is slow” — a far more actionable signal on-call.

What I’d revisit

The targeted cache was tuned to the query patterns as they were. I’d add automated detection of new hot patterns so the cache list maintains itself, instead of relying on someone noticing the next read-load creep in a dashboard.