Since the upgrade from Istio 1.7 to 1.8 we’re seeing issues with specially the communication from our Prometheus to our Alertmanager (pushing alarms via HTTP).
level=error ts=2021-05-31T07:16:45.210Z caller=notifier.go:527 component=notifier alertmanager=http://100.96.9.151:9093/api/v2/alerts count=1 msg="Error sending alert" err="bad response status 404 Not Found"
Both, Prometheus & Alertmanager are installed via the prometheus-operator and therefore two headless Kubernetes Service exist called prometheus-operated & alertmanager-operated , those services are hard-coded into the prometheus-operator and can not be changed.
Next to it each “installation” of those two components has an own service, which is not headless and under full-control by us. ( alertmanager-main & prometheus-k8s )
One thing we tried and also worked is to down scale the prometheus-operator to 0 and add the appProtocol tags on the -operated Kubernetes services with the value http which worked, but is overwritten, as long as the operator is not scaled to zero.
If we’re setting the appProtocol on the by us controlled service, it is not working out.
Any clue how to to fix it?
Version (include the output of istioctl version --remote and kubectl version --short and helm version --short if you used Helm)
➜ istioctl version --remote
client version: 1.10.0
control plane version: 1.8.5
data plane version: 1.8.5 (78 proxies)
➜ kubectl version --short
Client Version: v1.19.7
Server Version: v1.19.10
How was Istio installed?
via the istio-operator also in version 1.8.5
Environment where the bug was observed (cloud vendor, OS, etc)
when activating listenLocal=true, Prom post to alerts return 503:
level=error ts=2021-08-17T12:11:54.179Z caller=notifier.go:527 component=notifier alertmanager=http://172.31.36.94:9093/api/v2/alerts count=1 msg=“Error sending alert” err=“bad response status 503 Service Unavailable”
and showing → “upstream connect error or disconnect/reset before headers. reset reason: connection failure”
I am still trying to figure out how to make it work without this with 1.10+ istio with alertmanager listening on all IPs instead of localhost only
so it worked after removing duplicate service, i have 2 services for alertmanager alertmanager-operated and prom-kube-prometheus-stack-alertmanager, i removed alertmanager-operated and now everything is working, i had to delete this service 3-4 times, looks like operator was recreating it
looks like some conflict happened due to 2 services for same 9093 port