Prometheus consistently fails to scrape some targets within a Istio mesh

Hi all,

I’ve already asked in Prometheus community, but i think it’s good to ask also here to have another kind of comprehension of a behaviour that we’ve currently observing.

We’ve got a GKE Kubernetes cluster (v1.16.15-gke.7800) where Istio 1.8.3 has been installed and is managing pods from the default namespace.

Istio has also been installed in the cluster with the default “PERMISSIVE” mode, as to say that every envoy sidecar accepts plain http traffic (as far as I can understand)

Everything is deployed in default namespace, and everypod BUT prometheus/alertmanager/grafana is managed by Istio (i.e. the monitoring stack is out of the mesh. We’ve managed to do it by using the neverInjectSelector key in istio-sidecar-injector ConfigMap)

Prometheus can successfully scrape all its targets (defined via ServiceMonitors), every target but some that it fails consistently to scrape.

For example, from the logs of Prometheus i can see:

level=debug ts=2021-02-19T11:15:55.595Z caller=scrape.go:927 component="scrape manager" scrape_pool=default/divolte/0 target=http://10.172.22.36:7070/metrics msg="Scrape failed" err="server returned HTTP status 503 Service Unavailable"

But if i log into the Prometheus pod i can successully reach the pod that it’s failing to scrape

   /prometheus $ wget -SqO /dev/null http://10.172.22.36:7070/metrics
      HTTP/1.1 200 OK
      date: Fri, 19 Feb 2021 11:27:57 GMT
      content-type: text/plain; version=0.0.4; charset=utf-8
      content-length: 75758
      x-envoy-upstream-service-time: 57
      server: istio-envoy
      connection: close
      x-envoy-decorator-operation: divolte-srv.default.svc.cluster.local:7070/*

What am I missing? I see that the 503 is the actual response of the target, and that means that Prometheus is reaching it while scraping, but obtains the error.

What i cannot understand is what are the differences, in term of “networking path” and “involved pieces” between the scraping (that fails) and the “internal wget-ing” that succeeds. And i cannot also understand how to debug it.

Here the relevant logs from the main container and the Envoy/Istio proxy

Main container

❯ k logs divolte-dpy-594d8cb676-vgd9l prometheus-jmx-exporter
DEBUG: Environment variables set/received...
Service port (metrics): 7070
Destination host: localhost
Destination port: 5555
Rules to appy: divolte
Local JMX: 7071

CONFIG FILE not found, enabling PREPARE_CONFIG feature
Preparing configuration based on environment variables
Configuration preparation completed, final cofiguration dump:
############
---
hostPort: localhost:5555
username:
password:lowercaseOutputName: true
lowercaseOutputLabelNames: true
########
Starting Service..

Istio-proxy

❯ k logs divolte-dpy-594d8cb676-vgd9l istio-proxy -f

2021-02-22T07:41:15.450702Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012
2021-02-22T07:41:15.451182Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-02-22T07:41:15.894626Z info xdsproxy Envoy ADS stream established
2021-02-22T07:41:15.894837Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:15012
2021-02-22T08:11:25.679886Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012
2021-02-22T08:11:25.680655Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-02-22T08:11:25.936956Z info xdsproxy Envoy ADS stream established
2021-02-22T08:11:25.937120Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:15012
2021-02-22T08:39:56.813543Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012
2021-02-22T08:39:56.814249Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-02-22T08:39:57.183354Z info xdsproxy Envoy ADS stream established
2021-02-22T08:39:57.183653Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:150

Hi, I’ve managed to correctly activate istio-proxy logs, and that’s what I can see.

This is the log when I wget “inside” the Prometheus container, with success

[2021-02-23T10:58:55.066Z] "GET /metrics HTTP/1.1" 200 - "-" 0 75771 51 50 "-" "Wget" "4dae0790-1a6a-4750-bc33-4617a6fbaf16" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42380 10.172.22.36:7070 10.172.23.247:38210 - default

This is the log when Prometheus scraping fails

[2021-02-23T10:58:55.536Z] "GET /metrics HTTP/1.1" 503 UC "-" 0 95 53 - "-" "Prometheus/2.11.0" "2c97c597-6a32-44ed-a2fb-c1d37a2644b3" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42646 10.172.22.36:7070 10.172.23.247:33758 - default

Any clues? Thank you

Hi everyone,

we finally found the problem, was due to a bug in the old version ( 0.11 ) of jmx-exporter that we were using. It’s explained here: https://groups.google.com/u/1/g/prometheus-users/c/3e0jUuWmiAM

Upgrading jmx-exporter to the last version ( 0.15 ) fixed all the things :slight_smile: