Hi,
I’m new to Istio. I was trying to configure Prometheus to alert for the below metrics. I was also curious to know the “repair action” that needs to be taken when the alerts for the below metrics start firing. Can anyone please help me find on what needs to be done when the alerts for the below metrics start firing and how to resolve it?
- name: istio.alert.rules
rules:
- alert: IstioPilotAvailabilityDrop
annotations:
summary: 'Istio Pilot Availability Drop'
description: 'Pilot pods have dropped during the last 5m. Envoy sidecars might have outdated configuration'
expr: >
avg(avg_over_time(up{job="pilot"}[5m])) < 0.5
for: 5m
- alert: IstioMixerTelemetryAvailabilityDrop
annotations:
summary: 'Istio Mixer Telemetry Drop'
description: 'Mixer pods have dropped during the last 5m. Istio metrics will not work correctly'
expr: >
avg(avg_over_time(up{job="mixer", service="istio-telemetry", endpoint="http-monitoring"}[5m])) < 0.5
for: 5m
- alert: IstioGalleyAvailabilityDrop
annotations:
summary: 'Istio Galley Availability Drop'
description: 'Galley pods have dropped during the last 5m. Istio config ingestion and processing will not work'
expr: >
avg(avg_over_time(up{job="galley"}[5m])) < 0.5
for: 5m
- alert: IstioPilotPushErrorsHigh
annotations:
summary: 'Number of Istio Pilot push errors is too high'
description: 'Pilot has too many push errors during the last 5m. Envoy sidecars might have outdated configuration'
expr: >
sum(irate(pilot_xds_push_errors{job="pilot"}[5m])) / sum(irate(pilot_xds_pushes{job="pilot"}[5m])) > 0.05
for: 5m
- alert: IstioMixerPrometheusDispatchesLow
annotations:
summary: 'Number of Mixer dispatches to Prometheus is too low'
description: 'Mixer disptaches to Prometheus has dropped below normal levels during the last 5m. Istio metrics might not be being exported properly'
expr: >
sum(irate(mixer_runtime_dispatches_total{adapter=~"prometheus"}[5m])) < 1
for: 5m
- alert: IstioGlobalHTTP5xxRateHigh
annotations:
summary: 'Istio Percentage of HTTP 5xx responses is too high'
description: 'Istio global HTTP 5xx rate is too high in last 5m. The HTTP 5xx errors within the service mesh is unusually high'
expr: >
sum(irate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(irate(istio_requests_total{reporter="destination"}[5m])) > 0.01
for: 5m
Thanks.