Prometheus alerting on Istio Components

naveensrinivasan · May 6, 2019, 4:56pm

We are using Prometheus Operator with additional scrape configs based on https://github.com/istio/istio/blob/28f2fbbbb0bb4910130b3362cc18c780e1fac87b/install/kubernetes/helm/istio/charts/prometheus/templates/configmap.yaml#L98

We would like to know has anyone written additional Promql or alerts on Istio Components? Are there any repo’s with these additional prom alert that we can look at?

Thanks

douglas-reid · May 6, 2019, 6:58pm

Awhile back, we started down the path of a proposal for adding a set of standard alerts, but not much progress was made (as we got distracted with other items). If you have specific behavior that you would like to alert on, we’d love to have feedback on that to build into our on-going work on enabling Prom Operator integrations (the new installer should support ServiceMonitor generation currently).

naveensrinivasan · May 7, 2019, 2:58pm

We are looking from the istio authors to provide some guidelines on the metrics.

Is there any documentation on the prometheus metrics and how what are the expected range? That would be helpful for us. How do we measure the health status of istio?

Thanks

douglas-reid · May 7, 2019, 6:19pm

Unfortunately, we don’t have any good documentation on all of the metrics exposed by the various Istio components. I believe, however, that someone is actively working on such documentation.

As for measuring the health of Istio components, the best reference that exists today (at least, as published by the Istio team) is the set of component dashboards that we ship with the Grafana addon, as well as the performance dashboards.

One thing that might be useful is to raise this question on the Slack channels (maybe #policies-telemetry). I know that a few people that run Istio in production have developed their own alerts/dashboards based on the exposed metrics and can provide some tailored advice based on their experience.

I’ll update here when we have docs on the metrics available.

asadique · May 9, 2019, 3:54pm

I/we are also interested in this feature / alert capability. Any resource reference would be greatly appreciated

asadique · May 9, 2019, 4:01pm

@douglas-reid A dashboard that would give us the ability to troubleshoot a failure - which part of Istio is causing requests to fail. Having such a feature would be really helpful from an Ops perspective of Istio

douglas-reid · May 10, 2019, 5:13pm

@asadique have you looked at the various dashboards that are packaged with the Grafana addon? Do they address some of your needs? That is certainly the intent of those dashboards, though they definitely could use an update and reassessment.

naveensrinivasan · May 10, 2019, 7:11pm

We are planning to use the dashboard as a starting point to write some Promql with some alerting.

4c74356b41 · May 12, 2019, 8:29pm

hey, porting those dashboards to a standalone grafana is not an easy task, as far as I understand? would they work when grafana is in a different namespace? i think not. ideally i’d integrate them into prometheus-operator, but I dont really think that’s doable, right? you’d need some additional steps on istio side for this to work?

asadique · May 13, 2019, 5:13pm

I did, but they are a bit overwhelming for our platform consumers (development teams) to use. Looking to see what the community is doing in this space

asadique · May 13, 2019, 5:14pm

Hi Naveen, is any part of that effort open source by chance or can be shared ?

douglas-reid · May 13, 2019, 5:37pm

I’m not sure why they are not usable. I’d love some feedback on any issues that are being faced.

The dashboards are just JSON files and should be directly importable. And then they just need to be pointed at a prometheus data source. Which should be directly assignable using Grafana itself once the dashboards have been imported.

Namespaces should not matter.

4c74356b41 · May 13, 2019, 7:14pm

just by diffing istio yaml with grafana.enabled=true and =false I can see that the difference is not only the presence of grafana and custom dashboards, but some additional istio things are being created as well. Also, I’m not sure what needs to happen on prometheus side for these dashboards to work. I’ve tried setting up grafana with grafana.datasources.datasources.datasources.url=custom_prometheus_url (who invented this parameter name??) and grafana just fails to load the console.

douglas-reid · May 13, 2019, 7:59pm

There is no need to use helm in this case. You can take the JSON files directly. They are not templated.

When we are developing and manually testing the dashboards, we often just import them directly into an already running version of Grafana.

The data source stuff is all so that a stock grafana deployment can pick up the Istio dashboards as part of the addon install. Nothing requires its use if you just want the dashboards themselves.

asadique · May 13, 2019, 10:41pm

Hi @douglas-reid do you know whether the components of Istio keeps tracks of the service traffic that goes through them ? For eg: if I wanted to see why a specific service failed at the Mixer level, is it possible to get that kind of information from the metrics collected by the Istio components ?

douglas-reid · May 13, 2019, 10:55pm

@asadique Mixer reports to Mixer itself, which I think is what you are asking. You should be able to monitor traffic to both the istio-policy and istio-telemetry services using the istio_requests_total metric.

Galley and Pilot, however, don’t directly report to Mixer.

jshaughn · May 14, 2019, 1:40pm

Note that if you are using Kiali for observability you can investigate the istio-system namespace (or whatever namespace in which Istio is installed) visually.

asadique · May 14, 2019, 6:22pm

Thanks @douglas-reid. Just wondering, is istio-policy service check always invoked by Envoy for any service to service interaction or is it conditional ? I have the BookInfo app deployed to my cluster, but when checking Prometheus for metrics, I only see calls to the istio-telemetry service, but none to istio-policy service.

douglas-reid · May 14, 2019, 6:42pm

With Istio 1.1, istio-policy has been disabled by default. As a result, you will not see any calls to istio-policy unless you have explicitly enabled it. You should only enable it if you have specific policies you wish to enforce using the Mixer APIs.

asadique · May 14, 2019, 7:05pm

Thanks @douglas-reid , that make sense now. This diagram : https://istio.io/docs/concepts/security/architecture.svg has arrows going to Pilot with annotation that says Routing + Policy. My understanding is that Mixer is responsible for that feature. Is the diagram annotation misplaced or the responsibilities are shared by Pilot and Mixer ? I just wanted to make sure that I am not missing on metric for a security service call to Pilot, if there exists any.

Topic		Replies	Views
Scraping Istio metrics from Prometheus Operator (e.g. using ServiceMonitor)	6	18026	January 19, 2023
Best Practice for Prometheus/Grafana?	6	779	July 31, 2019
Grafana dashboards and prometheus alerting rules for istio 1.5	4	1542	May 12, 2020
Feedback Requested: Prometheus Operator support in Istio Policies and Telemetry	0	769	May 3, 2019
Grafana Pilot Dashboard xDS metrics missing Kiali	2	878	August 14, 2019

Prometheus alerting on Istio Components

Related topics