Feedback Requested: Production Monitoring with Prometheus

Just to follow up, as I’ve tested the suggested settings, and quite clearly illustrated the “rate then sum” issue:

This is the volumetry observed on one of my service, from the istio-prometheus instance (the one that keeps temporary metrics):

At about 13:04, I restarted the pods, which shows up on that graph. There’s a constant ~28 rps that drops to 0 at that point.

And now, the same observed from my master prometheus:

The “drop-to-zero” shows up there as a spike to 1390 rps, which isn’t connected to anything real, it’s just a fake spike due to summing before rating.