Prometheus alerting on Istio Components

@asadique unfortunately, Policy is a bit overloaded. There are some auth policies that are enforced in the proxy (without the need to go to Mixer). Those are configured and pushed to the proxies by Pilot, but Pilot does not currently do any enforcement itself (it is not called in the request path). You are not missing any metrics.

Some more advanced/configurable policy enforcement is also available via Mixer.

Does that make sense?

Thank you @douglas-reid

We ended up doing something like this

  - alert: IstioPilotAvailabilityDrop
    annotations:
      summary: 'Istio Pilot Availability Drop'
      description: 'Pilot pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Envoy sidecars might have outdated configuration'
    expr: >
      avg(avg_over_time(up{job="pilot"}[1m])) < 0.5
    for: 5m

  - alert: IstioMixerTelemetryAvailabilityDrop
    annotations:
      summary: 'Istio Mixer Telemetry Drop'
      description: 'Mixer pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio metrics will not work correctly'
    expr: >
      avg(avg_over_time(up{job="mixer", service="istio-telemetry", endpoint="http-monitoring"}[5m])) < 0.5
    for: 5m

  - alert: IstioGalleyAvailabilityDrop
    annotations:
      summary: 'Istio Galley Availability Drop'
      description: 'Galley pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio config ingestion and processing will not work'
    expr: >
      avg(avg_over_time(up{job="galley"}[5m])) < 0.5
    for: 5m

  - alert: IstioGatewayAvailabilityDrop
    annotations:
      summary: 'Istio Gateway Availability Drop'
      description: 'Gateway pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Inbound traffic will likely be affected'
    expr: >
      min(kube_deployment_status_replicas_available{deployment="istio-ingressgateway", namespace="istio-system"}) without (instance, pod) < 2
    for: 5m

  - alert: IstioPilotPushErrorsHigh
    annotations:
      summary: 'Number of Istio Pilot push errors is too high'
      description: 'Pilot has too many push errors during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Envoy sidecars might have outdated configuration'
    expr: >
      sum(irate(pilot_xds_push_errors{job="pilot"}[5m])) / sum(irate(pilot_xds_pushes{job="pilot"}[5m])) > 0.05
    for: 5m

  - alert: IstioMixerPrometheusDispatchesLow
    annotations:
      summary: 'Number of Mixer dispatches to Prometheus is too low'
      description: 'Mixer disptaches to Prometheus has dropped below normal levels during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio metrics might not be being exported properly'
    expr: >
      sum(irate(mixer_runtime_dispatches_total{adapter=~"prometheus"}[5m])) < 180
    for: 5m

  - alert: IstioGlobalRequestRateHigh
    annotations:
      summary: 'Istio Global Request Rate High'
      description: 'Istio global request rate is unusually high during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The amount of traffic being generated inside the service mesh is higher than normal'
    expr: >
      round(sum(irate(istio_requests_total{reporter="destination"}[5m])), 0.001) > 1200
    for: 5m

  - alert: IstioGlobalRequestRateLow
    annotations:
      summary: 'Istio global request rate too low'
      description: 'Istio global request rate is unusually low during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The amount of traffic being generated inside the service mesh has dropped below usual levels'
    expr: >
      round(sum(irate(istio_requests_total{reporter="destination"}[5m])), 0.001) < 300
    for: 5m

  - alert: IstioGlobalHTTP5xxRateHigh
    annotations:
      summary: 'Istio Percentage of HTTP 5xx responses is too high'
      description: 'Istio global HTTP 5xx rate is too high in last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The HTTP 5xx errors within the service mesh is unusually high'
    expr: >
       sum(irate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(irate(istio_requests_total{reporter="destination"}[5m])) > 0.01
    for: 5m

  - alert: IstioGatewayOutgoingSuccessLow
    annotations:
      summary: 'Istio Gateway outgoing success rate is too low'
      description: 'Istio Gateway success to outbound destinations is too low in last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Inbound traffic may be affected'
    expr: >
      sum(irate(istio_requests_total{reporter="source", source_workload="istio-ingressgateway",source_workload_namespace="istio-system", connection_security_policy!="mutual_tls",response_code!~"5.*"}[5m])) /  sum(irate(istio_requests_total{reporter="source", source_workload="istio-ingressgateway",source_workload_namespace="istio-system", connection_security_policy!="mutual_tls"}[5m])) < 0.995
    for: 5m

4 Likes

@crhuber Thanks for sharing.

Hi @crhuber

Are these configs added to istio prometheus configmap.yaml?
If possible can you share more details on how to setup the alerts and alertmanager with complete configutation for istio?

Hi,

Is there any way to add alertmanager on istio v1.3.3?

May be at the customizable install with Helm?

Thanks in advance.

Hi,

This is what i could find out:

Regards

@crhuber i don’t see pilot_xds_push_errors metric present in 1.4.2 and 1.4.5

do you see this metric present in your env?

Where did you put these? I had tried with prometheus configmap and adding to both alerting_rules.yml and alerts sections. But they are not visible.

4reference: Alerting rules | Prometheus

For istio1.10, suggest to view:

It includes:
‘ingress gateway traffic missing’
‘5xx rate too high’
The workload request latency P99 > 160ms
ProxyContainerCPUUsageHigh
ProxyContainerMemoryUsageHigh
IngressMemoryUsageIncreaseRateHigh
IstiodContainerCPUUsageHigh
IstiodMemoryUsageHigh
IstiodMemoryUsageIncreaseRateHigh
'istiod push errors is too high
‘istiod rejects rate is too high’
IstiodContainerNotReady
Ingress200RateLow
and so on.