Feedback Requested: Production Monitoring with Prometheus

All,

There have been a lot of questions about how best to setup monitoring of Istio with Prometheus, especially with the transition to v2 telemetry. This came up most recently in the Istio Community meeting.

To that end, I’ve taken an initial stab at a “Best Practices” guide, which is now available on our docs site: https://istio.io/docs/ops/best-practices/observability/#using-prometheus-for-production-scale-monitoring.

In drafting the doc, I leaned heavily on the experience of v2 early adopters within the community, particularly @Stono.

I’m hoping that the other Prometheus experts in the community can now take a look and provide corrections, improvements, and any additional tips as appropriate.

And, if you want other related guides or have valuable information to share of a similar nature, please let me and I’m happy to work on adding that information as well.

Thanks in advance for your consideration and help,
Doug.

1 Like

Hi Doug,

Thanks for raising this. I’ve read both Istio’s article and that previous one: https://karlstoney.com/2020/02/25/federated-prometheus-to-reduce-metric-cardinality/

I think there’s an important difference between them and I wonder if the other one isn’t more “correct”, regarding the ordering of operations: rate computing being done before summing. It relates to this rule of Prometheus: https://www.robustperception.io/rate-then-sum-never-sum-then-rate . I guess we could see bad effects of this when, for instance, a single pod that exposes metrics is restarted: data would be summed, then rated, which would generate invalid rates and spikes, because a counter would suddenly be reseted.

Of course, doing sum then rate simplifies a lot of things, as all grafana dashboards and Kiali would continue to run fine, but it’s at the price of having this “bug” at counter reset.

But I’d also like to have some advice from Prometheus experts :slight_smile: I hope I’m wrong

@jotak thanks for your insightful response. I struggled a bit in trying to decide how to provide useful aggregation that was still generic enough to cover most use cases – and think this is the source of the difference you note. I think the approach of optimizing for the queries you want to run (docs) is the best – and solves the rate then sum issue, as you can control it at recording time.

But I’m also curious to hear from experts ;).

Just to follow up, as I’ve tested the suggested settings, and quite clearly illustrated the “rate then sum” issue:

This is the volumetry observed on one of my service, from the istio-prometheus instance (the one that keeps temporary metrics):

At about 13:04, I restarted the pods, which shows up on that graph. There’s a constant ~28 rps that drops to 0 at that point.

And now, the same observed from my master prometheus:

The “drop-to-zero” shows up there as a spike to 1390 rps, which isn’t connected to anything real, it’s just a fake spike due to summing before rating.

In the guide it says to change the default Prometheus configuration to add some aggregation rules, but currently there’s no way of configuring the “default” Prometheus installation that comes when you install istio with istioctl. See prometheus configuration.

Wouldn’t scraping from the default Prometheus means having to configure mTLS between both Prometheus (if not set to permissive)? Adding how to configure federation to use mTLS would be great in that case.

@jotak do you think it would be helpful to rate and sum for something like 5m first in the guide to avoid the issue you are highlighting here? This would require updating the dashboards, etc., too. But maybe that’s worth it? I wasn’t sure what the right thing was to do along these lines.

@ypal configuring the packaged configuration does take manual work, but it is possible. you have to edit the config map that the default install includes.

we are currently thinking about including the recording rules already. but we need to make sure we get it right for the general case before we include it.

i’ll see about mTLS protection for prometheus. we’ve not been injecting sidecars into prometheus to date (other than to provide certs for communication with protected pods). we’d need get special config for proemtheus.

Thanks for your reply.

I didn’t think of manually editing the prometheus config because it will get overwritten the next time I update the manifest using Istioctl. It would be great to have this metrics aggregated by default (or maybe behind a configuration option)

The problem is that doing rate before sum leaves much less flexibility at query time, for instance we couldn’t adjust the rate interval at query time. I don’t see any perfect solution, in any case the user will have to deal with a trade-off. Also, Kiali is not ready today to work with pre-aggregated metrics, so it would be broken.

I plan to write a blog post specifically to Kiali about this kind of setup (you can see here I’ve already written some feedback: https://github.com/kiali/kiali/issues/2518#issuecomment-606617619 ).

In my opinion, it is fine to present these two scenarios to the users: the one with preserved metric names but with a sum-then-rate issue , and the one with modified metrics which would break consumers / kiali. And so the users are aware of the trade-off that the choice implies.

Also it’s worth noting that with previous telemetry with Mixer, I believe there already was that “sum-then-rate” issue. I don’t know the implementation details in Mixer but I remember to have seen this same kind of spikes sometimes when pods restart ; if that’s true, it wouldn’t be a regression in that regard.