Issue with HTTP communication from Prometheus to Alertmanager

Bug description

Since the upgrade from Istio 1.7 to 1.8 we’re seeing issues with specially the communication from our Prometheus to our Alertmanager (pushing alarms via HTTP).

Error message:

level=error ts=2021-05-31T07:16:45.210Z caller=notifier.go:527 component=notifier alertmanager=http://100.96.9.151:9093/api/v2/alerts count=1 msg="Error sending alert" err="bad response status 404 Not Found"

Both, Prometheus & Alertmanager are installed via the prometheus-operator and therefore two headless Kubernetes Service exist called prometheus-operated & alertmanager-operated , those services are hard-coded into the prometheus-operator and can not be changed.
Next to it each “installation” of those two components has an own service, which is not headless and under full-control by us. ( alertmanager-main & prometheus-k8s )

One thing we tried and also worked is to down scale the prometheus-operator to 0 and add the appProtocol tags on the -operated Kubernetes services with the value http which worked, but is overwritten, as long as the operator is not scaled to zero.

If we’re setting the appProtocol on the by us controlled service, it is not working out.

Any clue how to to fix it?

Version (include the output of istioctl version --remote and kubectl version --short and helm version --short if you used Helm)

Istio version:

➜ istioctl version --remote
client version: 1.10.0
control plane version: 1.8.5
data plane version: 1.8.5 (78 proxies)

Kubernetes version:

➜ kubectl version --short
Client Version: v1.19.7
Server Version: v1.19.10

How was Istio installed?

via the istio-operator also in version 1.8.5

Environment where the bug was observed (cloud vendor, OS, etc)

Running on a kOps cluster on AWS.

2 Likes

We are seeing the same issue.
@guusvw let me know if you managed to find acceptable solution

alertmanager stopped listening on localhost seems causing this issue, check out this https://github.com/prometheus-operator/prometheus-operator/pull/4038

you can set alertmanager.alertmanagerSpec.listenLocal=True in alertmanager CR to make it listen on localhost

istio 1.9 and lower versions require app to listen on localhost, 1.10 and higher doesnt

1 Like

Same here,
when activating listenLocal=true, Prom post to alerts return 503:
level=error ts=2021-08-17T12:11:54.179Z caller=notifier.go:527 component=notifier alertmanager=http://172.31.36.94:9093/api/v2/alerts count=1 msg=“Error sending alert” err=“bad response status 503 Service Unavailable”

and showing → “upstream connect error or disconnect/reset before headers. reset reason: connection failure”

when disabling listenLocal, alerts return 404:

“POST /api/v2/alerts HTTP/1.1” 404 NR route_not_found - “-” 0 0 0 - “-” “Prometheus/2.24.0” “d673b25b-ead0-429a-8f59-d881c1804ab5” “172.31.36.35:9093” “-” - - 172.31.36.35:9093 172.31.36.94:34952 - -

level=error ts=2021-08-17T11:45:22.465Z caller=notifier.go:527 component=notifier alertmanager=http://172.31.36.35:9093/api/v2/alerts count=1 msg=“Error sending alert” err=“bad response status 404 Not Found”

the only workaround is to disable istio sidecar injection inside monitoring namespace

Please help!

if you have prometheus operator + strict mtls set within your monitoring namespace, then it needs extra work, check [kube-prometheus-stack] tlsConfig support for servicemonitors and default alertingEndpoints to work with istio strict mtls · Issue #145 · prometheus-community/helm-charts · GitHub

with 1.10 and 1.11, alertmanager breaks, one workaround i have found is to make alertmanager listen on local only and use sidecar to route traffic to localhost 9093

apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: alertmanager
spec:
  workloadSelector:
    labels:
      alertmanager: prom-kube-prometheus-stack-alertmanager
  ingress:
  - port:
      number: 9093
      protocol: TCP
      name: tcp
    defaultEndpoint: 127.0.0.1:9093

I am still trying to figure out how to make it work without this with 1.10+ istio with alertmanager listening on all IPs instead of localhost only

so it worked after removing duplicate service, i have 2 services for alertmanager alertmanager-operated and prom-kube-prometheus-stack-alertmanager, i removed alertmanager-operated and now everything is working, i had to delete this service 3-4 times, looks like operator was recreating it

looks like some conflict happened due to 2 services for same 9093 port

2 Likes

Thanks! Will check it out
listenLocal should be true also for Prometheus or for alertmanager only?

still, hope that istio will solve this properly without all those workarounds

i made listenLocal true only for alertmanager

actually prometheus operator helm should come with all these options