Prometheus alerting on Istio Components

We are using Prometheus Operator with additional scrape configs based on https://github.com/istio/istio/blob/28f2fbbbb0bb4910130b3362cc18c780e1fac87b/install/kubernetes/helm/istio/charts/prometheus/templates/configmap.yaml#L98

We would like to know has anyone written additional Promql or alerts on Istio Components? Are there any repo’s with these additional prom alert that we can look at?

Thanks

Awhile back, we started down the path of a proposal for adding a set of standard alerts, but not much progress was made (as we got distracted with other items). If you have specific behavior that you would like to alert on, we’d love to have feedback on that to build into our on-going work on enabling Prom Operator integrations (the new installer should support ServiceMonitor generation currently).

We are looking from the istio authors to provide some guidelines on the metrics.

Is there any documentation on the prometheus metrics and how what are the expected range? That would be helpful for us. How do we measure the health status of istio?

Thanks

Unfortunately, we don’t have any good documentation on all of the metrics exposed by the various Istio components. I believe, however, that someone is actively working on such documentation.

As for measuring the health of Istio components, the best reference that exists today (at least, as published by the Istio team) is the set of component dashboards that we ship with the Grafana addon, as well as the performance dashboards.

One thing that might be useful is to raise this question on the Slack channels (maybe #policies-telemetry). I know that a few people that run Istio in production have developed their own alerts/dashboards based on the exposed metrics and can provide some tailored advice based on their experience.

I’ll update here when we have docs on the metrics available.

I/we are also interested in this feature / alert capability. Any resource reference would be greatly appreciated

@douglas-reid A dashboard that would give us the ability to troubleshoot a failure - which part of Istio is causing requests to fail. Having such a feature would be really helpful from an Ops perspective of Istio

@asadique have you looked at the various dashboards that are packaged with the Grafana addon? Do they address some of your needs? That is certainly the intent of those dashboards, though they definitely could use an update and reassessment.

We are planning to use the dashboard as a starting point to write some Promql with some alerting.

hey, porting those dashboards to a standalone grafana is not an easy task, as far as I understand? would they work when grafana is in a different namespace? i think not. ideally i’d integrate them into prometheus-operator, but I dont really think that’s doable, right? you’d need some additional steps on istio side for this to work?

I did, but they are a bit overwhelming for our platform consumers (development teams) to use. Looking to see what the community is doing in this space

Hi Naveen, is any part of that effort open source by chance or can be shared ?

I’m not sure why they are not usable. I’d love some feedback on any issues that are being faced.

The dashboards are just JSON files and should be directly importable. And then they just need to be pointed at a prometheus data source. Which should be directly assignable using Grafana itself once the dashboards have been imported.

Namespaces should not matter.

just by diffing istio yaml with grafana.enabled=true and =false I can see that the difference is not only the presence of grafana and custom dashboards, but some additional istio things are being created as well. Also, I’m not sure what needs to happen on prometheus side for these dashboards to work. I’ve tried setting up grafana with grafana.datasources.datasources.datasources.url=custom_prometheus_url (who invented this parameter name??) and grafana just fails to load the console.

There is no need to use helm in this case. You can take the JSON files directly. They are not templated.

When we are developing and manually testing the dashboards, we often just import them directly into an already running version of Grafana.

The data source stuff is all so that a stock grafana deployment can pick up the Istio dashboards as part of the addon install. Nothing requires its use if you just want the dashboards themselves.

Hi @douglas-reid do you know whether the components of Istio keeps tracks of the service traffic that goes through them ? For eg: if I wanted to see why a specific service failed at the Mixer level, is it possible to get that kind of information from the metrics collected by the Istio components ?

@asadique Mixer reports to Mixer itself, which I think is what you are asking. You should be able to monitor traffic to both the istio-policy and istio-telemetry services using the istio_requests_total metric.

Galley and Pilot, however, don’t directly report to Mixer.

Note that if you are using Kiali for observability you can investigate the istio-system namespace (or whatever namespace in which Istio is installed) visually.

Thanks @douglas-reid. Just wondering, is istio-policy service check always invoked by Envoy for any service to service interaction or is it conditional ? I have the BookInfo app deployed to my cluster, but when checking Prometheus for metrics, I only see calls to the istio-telemetry service, but none to istio-policy service.

With Istio 1.1, istio-policy has been disabled by default. As a result, you will not see any calls to istio-policy unless you have explicitly enabled it. You should only enable it if you have specific policies you wish to enforce using the Mixer APIs.

Thanks @douglas-reid , that make sense now. This diagram : https://istio.io/docs/concepts/security/architecture.svg has arrows going to Pilot with annotation that says Routing + Policy. My understanding is that Mixer is responsible for that feature. Is the diagram annotation misplaced or the responsibilities are shared by Pilot and Mixer ? I just wanted to make sure that I am not missing on metric for a security service call to Pilot, if there exists any.