Making cluster spend visible enough to cut

Built internal tooling that attributes cluster spend to the product lines actually consuming it. The visibility alone drove a ~30% reduction in cluster resource consumption.

Context

Cluster spend was a single opaque line item. We knew what the infrastructure cost in total, but not which product lines were responsible for it — which meant nobody could make an informed tradeoff between a feature’s value and its compute footprint. Infrastructure was being treated as a fixed background cost rather than a variable a team could own and improve.

Constraints

Shared cluster, mixed workloads. Product lines co-tenant the same nodes, so cost isn’t visible at the billing boundary the way it is with separate accounts.
The data already existed — it just wasn’t connected. Resource metrics and workload labels were there; nothing tied them back to product ownership.
The goal was behavior change, not a report. A dashboard nobody acts on saves nothing.

Approach

I built an internal tool that attributes spend per product line, joining Kubernetes resource metrics (from Prometheus) against workload labels to map consumption onto the domains and products that own it. The breakdown surfaced in Grafana so each team could see — and therefore own — its own footprint, with the numbers framed in terms teams already cared about rather than raw CPU-seconds.

  prometheus metrics + workload labels
           │
   [attribution tool]
   resource usage → product owner
           │
   grafana · per-team dashboards
           │
   teams own their footprint
   → optimizations follow

Decision — attribution before optimization. The instinct is to jump straight to right-sizing requests and limits. But un-owned cost doesn’t get optimized; it gets ignored. Making spend visible per owner was the lever — it converted a platform problem into something each team had a reason to act on, and the optimizations followed on their own.

Decision — reuse existing metrics, don’t add an agent. I deliberately built this on the Prometheus data already being collected rather than deploying a new cost-metering layer. Less to run, nothing new to secure, and the attribution was good enough to drive decisions — which was the entire point.

Outcome

The visibility alone drove a ~30% reduction in cluster resource consumption. Beyond the number, it changed how infrastructure decisions got made: cost became a variable teams reasoned about alongside features, instead of a black box owned by no one. That’s the same instinct — tie infrastructure to the business — that runs through the rest of my platform work.

What I’d revisit

Attribution was point-in-time reporting. The natural next step is a budget-and-alert loop, so a product line that drifts past its expected footprint gets flagged at the moment it happens rather than discovered in the next review.