Monitoring Istio components?

Hey
I’m new to Istio, and I’m trying to figure, as a cluster operator, what are the things I should monitor before providing Istio for my users. I assume that mixer and pilot are the critical services, and I should monitor that they are running - but is there an existing documentation to what I should monitor?

A good place to start may be the existing Grafana dashboards. They provide examples of monitoring Istio components (and the queries that power that monitoring).

You can also look through the docs on the metrics exported by the various Istio components (for example, Pilot).

That’s what I ended up doing, but having such a documentation could be really valuable.

@omerlh

I ended up doing something like this

  - alert: IstioPilotAvailabilityDrop
    annotations:
      summary: 'Istio Pilot Availability Drop'
      description: 'Pilot pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Envoy sidecars might have outdated configuration'
    expr: >
      avg(avg_over_time(up{job="pilot"}[1m])) < 0.5
    for: 5m

  - alert: IstioMixerTelemetryAvailabilityDrop
    annotations:
      summary: 'Istio Mixer Telemetry Drop'
      description: 'Mixer pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio metrics will not work correctly'
    expr: >
      avg(avg_over_time(up{job="mixer", service="istio-telemetry", endpoint="http-monitoring"}[5m])) < 0.5
    for: 5m

  - alert: IstioGalleyAvailabilityDrop
    annotations:
      summary: 'Istio Galley Availability Drop'
      description: 'Galley pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio config ingestion and processing will not work'
    expr: >
      avg(avg_over_time(up{job="galley"}[5m])) < 0.5
    for: 5m

  - alert: IstioGatewayAvailabilityDrop
    annotations:
      summary: 'Istio Gateway Availability Drop'
      description: 'Gateway pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Inbound traffic will likely be affected'
    expr: >
      min(kube_deployment_status_replicas_available{deployment="istio-ingressgateway", namespace="istio-system"}) without (instance, pod) < 2
    for: 5m

  - alert: IstioPilotPushErrorsHigh
    annotations:
      summary: 'Number of Istio Pilot push errors is too high'
      description: 'Pilot has too many push errors during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Envoy sidecars might have outdated configuration'
    expr: >
      sum(irate(pilot_xds_push_errors{job="pilot"}[5m])) / sum(irate(pilot_xds_pushes{job="pilot"}[5m])) > 0.05
    for: 5m

  - alert: IstioMixerPrometheusDispatchesLow
    annotations:
      summary: 'Number of Mixer dispatches to Prometheus is too low'
      description: 'Mixer disptaches to Prometheus has dropped below normal levels during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio metrics might not be being exported properly'
    expr: >
      sum(irate(mixer_runtime_dispatches_total{adapter=~"prometheus"}[5m])) < 180
    for: 5m

  - alert: IstioGlobalRequestRateHigh
    annotations:
      summary: 'Istio Global Request Rate High'
      description: 'Istio global request rate is unusually high during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The amount of traffic being generated inside the service mesh is higher than normal'
    expr: >
      round(sum(irate(istio_requests_total{reporter="destination"}[5m])), 0.001) > 1200
    for: 5m

  - alert: IstioGlobalRequestRateLow
    annotations:
      summary: 'Istio global request rate too low'
      description: 'Istio global request rate is unusually low during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The amount of traffic being generated inside the service mesh has dropped below usual levels'
    expr: >
      round(sum(irate(istio_requests_total{reporter="destination"}[5m])), 0.001) < 300
    for: 5m

  - alert: IstioGlobalHTTP5xxRateHigh
    annotations:
      summary: 'Istio Percentage of HTTP 5xx responses is too high'
      description: 'Istio global HTTP 5xx rate is too high in last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The HTTP 5xx errors within the service mesh is unusually high'
    expr: >
       sum(irate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(irate(istio_requests_total{reporter="destination"}[5m])) > 0.01
    for: 5m

  - alert: IstioGatewayOutgoingSuccessLow
    annotations:
      summary: 'Istio Gateway outgoing success rate is too low'
      description: 'Istio Gateway success to outbound destinations is too low in last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Inbound traffic may be affected'
    expr: >
      sum(irate(istio_requests_total{reporter="source", source_workload="istio-ingressgateway",source_workload_namespace="istio-system", connection_security_policy!="mutual_tls",response_code!~"5.*"}[5m])) /  sum(irate(istio_requests_total{reporter="source", source_workload="istio-ingressgateway",source_workload_namespace="istio-system", connection_security_policy!="mutual_tls"}[5m])) < 0.995
    for: 5m

@crhuber

in my istio env 1.4.5 pilot_xds_push_errors metric isnt available, does it get created only there there are errors? or do you know if this metric is removed?

@deepak_deore it looks like that metric was removed in this PR: https://github.com/istio/istio/pull/15405.

thanks @douglas-reid, not sure if there is any other equivalent metric available

Hi,
I’m new to Istio. I was trying to configure prometheus to alert for the above metric. I was also curious to know the “repair action” that needs to taken when the above alerts start firing. Can anyone please help me find on what needs to be done when above metrics start firing and how to resolve it.
Thanks.