Feedback Requested: Production Monitoring with Prometheus

All,

There have been a lot of questions about how best to setup monitoring of Istio with Prometheus, especially with the transition to v2 telemetry. This came up most recently in the Istio Community meeting.

To that end, I’ve taken an initial stab at a “Best Practices” guide, which is now available on our docs site: https://istio.io/docs/ops/best-practices/observability/#using-prometheus-for-production-scale-monitoring.

In drafting the doc, I leaned heavily on the experience of v2 early adopters within the community, particularly @Stono.

I’m hoping that the other Prometheus experts in the community can now take a look and provide corrections, improvements, and any additional tips as appropriate.

And, if you want other related guides or have valuable information to share of a similar nature, please let me and I’m happy to work on adding that information as well.

Thanks in advance for your consideration and help,
Doug.

1 Like

Hi Doug,

Thanks for raising this. I’ve read both Istio’s article and that previous one: https://karlstoney.com/2020/02/25/federated-prometheus-to-reduce-metric-cardinality/

I think there’s an important difference between them and I wonder if the other one isn’t more “correct”, regarding the ordering of operations: rate computing being done before summing. It relates to this rule of Prometheus: https://www.robustperception.io/rate-then-sum-never-sum-then-rate . I guess we could see bad effects of this when, for instance, a single pod that exposes metrics is restarted: data would be summed, then rated, which would generate invalid rates and spikes, because a counter would suddenly be reseted.

Of course, doing sum then rate simplifies a lot of things, as all grafana dashboards and Kiali would continue to run fine, but it’s at the price of having this “bug” at counter reset.

But I’d also like to have some advice from Prometheus experts :slight_smile: I hope I’m wrong

@jotak thanks for your insightful response. I struggled a bit in trying to decide how to provide useful aggregation that was still generic enough to cover most use cases – and think this is the source of the difference you note. I think the approach of optimizing for the queries you want to run (docs) is the best – and solves the rate then sum issue, as you can control it at recording time.

But I’m also curious to hear from experts ;).

Just to follow up, as I’ve tested the suggested settings, and quite clearly illustrated the “rate then sum” issue:

This is the volumetry observed on one of my service, from the istio-prometheus instance (the one that keeps temporary metrics):

At about 13:04, I restarted the pods, which shows up on that graph. There’s a constant ~28 rps that drops to 0 at that point.

And now, the same observed from my master prometheus:

The “drop-to-zero” shows up there as a spike to 1390 rps, which isn’t connected to anything real, it’s just a fake spike due to summing before rating.

In the guide it says to change the default Prometheus configuration to add some aggregation rules, but currently there’s no way of configuring the “default” Prometheus installation that comes when you install istio with istioctl. See prometheus configuration.

Wouldn’t scraping from the default Prometheus means having to configure mTLS between both Prometheus (if not set to permissive)? Adding how to configure federation to use mTLS would be great in that case.