"http://127.0.0.1:15000/stats?usedonly" response takes a long time

Hi.

I’m verifying istio 1.6.3 now.
It appears to be working fine, but only certain Pods frequently output the following log from the istio-proxy container.

istio-proxy 2020-06-25T00:24:28.709648Z    warn    Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The actual access on the istio-proxy container can certainly be more than 1s.

time curl "http://127.0.0.1:15000/stats?usedonly&filter=%5E%28server.state%7Clistener_manager.workers_started%29"

real	0m5.600s
user	0m0.000s
sys	0m0.011s

To try it out, I removed the filter from the above query and accessed it.

curl “http://127.0.0.1:15000/stats?usedonly

Then the following records were returned over 1,000 like the following

reporter=.=source;.;source_workload=.=foo-main-service;.;source_workload_namespace=.=foo;.;source_principal=.=unknown;.;source_app=.=foo-main-service;.;source_version=.=unknown;.;source_canonical_service=.=foo-main-service;.;source_canonical_revision=.=latest;.;destination_workload=.=bar-master-service;.;destination_workload_namespace=.=bar;.;destination_principal=.=unknown;.;destination_app=.=bar-master-service;.;destination_version=.=unknown;.;destination_service=.=bar-master-service.bar.svc.cluster.local;.;destination_service_name=.=bar-master-service;.;destination_service_namespace=.=bar;.;destination_canonical_service=.=bar-master-service;.;destination_canonical_revision=.=latest;.;request_protocol=.=grpc;.;response_code=.=200;.;grpc_response_status=.=0;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_response_bytes: P0(nan,330.0) P25(nan,545.636) P50(nan,860.983) P75(nan,2904.91) P90(nan,2961.97) P95(nan,2980.98) P99(nan,2996.2) P99.5(nan,2998.1) P99.9(nan,2999.62) P100(nan,3000.0)

Then more than 1,000 records were returned, as shown below.
I think this is the cause, but what is the solution?

I solved this problem by setting spec.values.telemetry.v2. prometheus.enabled=false in the Istio configuration file.

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: istio_operator
spec:
...
  values:
...
    telemetry:
      v2:
        prometheus:
          enabled: false
...

@11110 I am also facing the same issue. Some of the pods giving this error. I can try the solution that you provided but will that disable metrics?

We’re seeing this same issue. It appears to be correlated with timeouts from external traffic.

We are running Istio 1.6.8 in Azure AKS clusters in multiple Azure regions and are seeing these across all of our clusters.

Our kubernetes events show readiness failures from all of our ingress gateways:

$ kubectl -n istio-networking get events
LAST SEEN   TYPE      REASON      OBJECT                              MESSAGE
3m11s       Warning   Unhealthy   pod/root-gateway-6f795d6ddd-2tdb5   Readiness probe failed: Get http://10.100.3.138:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2m48s       Warning   Unhealthy   pod/root-gateway-6f795d6ddd-5gdgk   Readiness probe failed: Get http://10.100.2.0:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m11s       Warning   Unhealthy   pod/root-gateway-6f795d6ddd-g5scm   Readiness probe failed: Get http://10.100.10.150:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m55s       Warning   Unhealthy   pod/root-gateway-6f795d6ddd-q4bl7   Readiness probe failed: Get http://10.100.0.141:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m10s       Warning   Unhealthy   pod/root-gateway-6f795d6ddd-rd7hg   Readiness probe failed: Get http://10.100.4.245:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Looking at the logs for one root-gateway pod, it looks like it toggles between ready and readiness states fairly frequently. Sometimes, it can switch back and forth within a second or two; other times, it’s tens of seconds.

2020-11-25T14:12:56.807999Z     info    Envoy proxy is ready
2020-11-25T14:13:39.806982Z     warn    Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:13:40.807928Z     info    Envoy proxy is ready
2020-11-25T14:14:55.576062Z     warning envoy config    [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamAggregatedResources gRPC config stream closed: 13,
2020-11-25T14:16:07.807146Z     warn    Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:16:09.807107Z     warn    Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:16:10.807974Z     info    Envoy proxy is ready
2020-11-25T14:16:49.806974Z     warn    Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:16:50.808131Z     info    Envoy proxy is ready
2020-11-25T14:26:33.451799Z     info    Subchannel Connectivity change to CONNECTING
2020-11-25T14:26:33.451892Z     info    transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-11-25T14:26:33.451910Z     info    pickfirstBalancer: HandleSubConnStateChange: 0xc000dce640, {CONNECTING <nil>}
2020-11-25T14:26:33.451939Z     info    Channel Connectivity change to CONNECTING
2020-11-25T14:26:33.451933Z     info    Subchannel picks a new address "istiod.istio-system.svc:15012" to connect
2020-11-25T14:26:33.460759Z     info    Subchannel Connectivity change to READY
2020-11-25T14:26:33.460791Z     info    pickfirstBalancer: HandleSubConnStateChange: 0xc000dce640, {READY <nil>}
2020-11-25T14:26:33.460812Z     info    Channel Connectivity change to READY
2020-11-25T14:35:09.806868Z     warn    Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:35:11.081282Z     info    Envoy proxy is ready

We tried making the above change to disable spec.values.telemetry.v2.prometheus.enabled but it didn’t seem to help much.

These errors are not constant; but when we see them, they may generate thousands of failure over a 5 minute period (this is a graph of log entries containing “Envoy proxy is NOT ready” and “Client.Timeout exceeded while awaiting headers” over the last 12 hours, binned in 5m increments.

We’re starting a test now to disable all of the telemetry values; something like

spec:
  values:
    telemetry:
      enabled: false
      v2:
        enabled: false
        prometheus:
          enabled: false

It’ll be several hours before we know how that goes. Very curious to know if anyone else has seen this and come up with another response in the meantime. I haven’t been able to find much about it beyond this thread.

Disabling telemetry did not work for us.

However, we did upgrade from 1.6.8 to 1.6.14 which seems to have solved our issue.

In particular, 1.6.10 included this change which has stopped our ingress gateways from toggling between ready and unready states repeatedly.

We believe that this made the difference for us; though we also haven’t been seeing the errors in the envoy logs either so it’s possible/likely that the underlying root cause was also addressed somewhere between 1.6.8 and 1.6.14.

[Edit: Turns out this didn’t solve our issue. We continue investigating.]