Monitoring Istio components?

Hey
I’m new to Istio, and I’m trying to figure, as a cluster operator, what are the things I should monitor before providing Istio for my users. I assume that mixer and pilot are the critical services, and I should monitor that they are running - but is there an existing documentation to what I should monitor?

A good place to start may be the existing Grafana dashboards. They provide examples of monitoring Istio components (and the queries that power that monitoring).

You can also look through the docs on the metrics exported by the various Istio components (for example, Pilot).

That’s what I ended up doing, but having such a documentation could be really valuable.

@omerlh

I ended up doing something like this

  - alert: IstioPilotAvailabilityDrop
    annotations:
      summary: 'Istio Pilot Availability Drop'
      description: 'Pilot pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Envoy sidecars might have outdated configuration'
    expr: >
      avg(avg_over_time(up{job="pilot"}[1m])) < 0.5
    for: 5m

  - alert: IstioMixerTelemetryAvailabilityDrop
    annotations:
      summary: 'Istio Mixer Telemetry Drop'
      description: 'Mixer pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio metrics will not work correctly'
    expr: >
      avg(avg_over_time(up{job="mixer", service="istio-telemetry", endpoint="http-monitoring"}[5m])) < 0.5
    for: 5m

  - alert: IstioGalleyAvailabilityDrop
    annotations:
      summary: 'Istio Galley Availability Drop'
      description: 'Galley pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio config ingestion and processing will not work'
    expr: >
      avg(avg_over_time(up{job="galley"}[5m])) < 0.5
    for: 5m

  - alert: IstioGatewayAvailabilityDrop
    annotations:
      summary: 'Istio Gateway Availability Drop'
      description: 'Gateway pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Inbound traffic will likely be affected'
    expr: >
      min(kube_deployment_status_replicas_available{deployment="istio-ingressgateway", namespace="istio-system"}) without (instance, pod) < 2
    for: 5m

  - alert: IstioPilotPushErrorsHigh
    annotations:
      summary: 'Number of Istio Pilot push errors is too high'
      description: 'Pilot has too many push errors during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Envoy sidecars might have outdated configuration'
    expr: >
      sum(irate(pilot_xds_push_errors{job="pilot"}[5m])) / sum(irate(pilot_xds_pushes{job="pilot"}[5m])) > 0.05
    for: 5m

  - alert: IstioMixerPrometheusDispatchesLow
    annotations:
      summary: 'Number of Mixer dispatches to Prometheus is too low'
      description: 'Mixer disptaches to Prometheus has dropped below normal levels during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio metrics might not be being exported properly'
    expr: >
      sum(irate(mixer_runtime_dispatches_total{adapter=~"prometheus"}[5m])) < 180
    for: 5m

  - alert: IstioGlobalRequestRateHigh
    annotations:
      summary: 'Istio Global Request Rate High'
      description: 'Istio global request rate is unusually high during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The amount of traffic being generated inside the service mesh is higher than normal'
    expr: >
      round(sum(irate(istio_requests_total{reporter="destination"}[5m])), 0.001) > 1200
    for: 5m

  - alert: IstioGlobalRequestRateLow
    annotations:
      summary: 'Istio global request rate too low'
      description: 'Istio global request rate is unusually low during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The amount of traffic being generated inside the service mesh has dropped below usual levels'
    expr: >
      round(sum(irate(istio_requests_total{reporter="destination"}[5m])), 0.001) < 300
    for: 5m

  - alert: IstioGlobalHTTP5xxRateHigh
    annotations:
      summary: 'Istio Percentage of HTTP 5xx responses is too high'
      description: 'Istio global HTTP 5xx rate is too high in last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The HTTP 5xx errors within the service mesh is unusually high'
    expr: >
       sum(irate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(irate(istio_requests_total{reporter="destination"}[5m])) > 0.01
    for: 5m

  - alert: IstioGatewayOutgoingSuccessLow
    annotations:
      summary: 'Istio Gateway outgoing success rate is too low'
      description: 'Istio Gateway success to outbound destinations is too low in last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Inbound traffic may be affected'
    expr: >
      sum(irate(istio_requests_total{reporter="source", source_workload="istio-ingressgateway",source_workload_namespace="istio-system", connection_security_policy!="mutual_tls",response_code!~"5.*"}[5m])) /  sum(irate(istio_requests_total{reporter="source", source_workload="istio-ingressgateway",source_workload_namespace="istio-system", connection_security_policy!="mutual_tls"}[5m])) < 0.995
    for: 5m