We’re seeing this same issue. It appears to be correlated with timeouts from external traffic.
We are running Istio 1.6.8 in Azure AKS clusters in multiple Azure regions and are seeing these across all of our clusters.
Our kubernetes events show readiness failures from all of our ingress gateways:
$ kubectl -n istio-networking get events
LAST SEEN TYPE REASON OBJECT MESSAGE
3m11s Warning Unhealthy pod/root-gateway-6f795d6ddd-2tdb5 Readiness probe failed: Get http://10.100.3.138:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2m48s Warning Unhealthy pod/root-gateway-6f795d6ddd-5gdgk Readiness probe failed: Get http://10.100.2.0:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m11s Warning Unhealthy pod/root-gateway-6f795d6ddd-g5scm Readiness probe failed: Get http://10.100.10.150:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m55s Warning Unhealthy pod/root-gateway-6f795d6ddd-q4bl7 Readiness probe failed: Get http://10.100.0.141:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m10s Warning Unhealthy pod/root-gateway-6f795d6ddd-rd7hg Readiness probe failed: Get http://10.100.4.245:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Looking at the logs for one root-gateway pod, it looks like it toggles between ready and readiness states fairly frequently. Sometimes, it can switch back and forth within a second or two; other times, it’s tens of seconds.
2020-11-25T14:12:56.807999Z info Envoy proxy is ready
2020-11-25T14:13:39.806982Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:13:40.807928Z info Envoy proxy is ready
2020-11-25T14:14:55.576062Z warning envoy config [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamAggregatedResources gRPC config stream closed: 13,
2020-11-25T14:16:07.807146Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:16:09.807107Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:16:10.807974Z info Envoy proxy is ready
2020-11-25T14:16:49.806974Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:16:50.808131Z info Envoy proxy is ready
2020-11-25T14:26:33.451799Z info Subchannel Connectivity change to CONNECTING
2020-11-25T14:26:33.451892Z info transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-11-25T14:26:33.451910Z info pickfirstBalancer: HandleSubConnStateChange: 0xc000dce640, {CONNECTING <nil>}
2020-11-25T14:26:33.451939Z info Channel Connectivity change to CONNECTING
2020-11-25T14:26:33.451933Z info Subchannel picks a new address "istiod.istio-system.svc:15012" to connect
2020-11-25T14:26:33.460759Z info Subchannel Connectivity change to READY
2020-11-25T14:26:33.460791Z info pickfirstBalancer: HandleSubConnStateChange: 0xc000dce640, {READY <nil>}
2020-11-25T14:26:33.460812Z info Channel Connectivity change to READY
2020-11-25T14:35:09.806868Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:35:11.081282Z info Envoy proxy is ready
We tried making the above change to disable spec.values.telemetry.v2.prometheus.enabled
but it didn’t seem to help much.
These errors are not constant; but when we see them, they may generate thousands of failure over a 5 minute period (this is a graph of log entries containing “Envoy proxy is NOT ready” and “Client.Timeout exceeded while awaiting headers” over the last 12 hours, binned in 5m increments.
We’re starting a test now to disable all of the telemetry values; something like
spec:
values:
telemetry:
enabled: false
v2:
enabled: false
prometheus:
enabled: false
It’ll be several hours before we know how that goes. Very curious to know if anyone else has seen this and come up with another response in the meantime. I haven’t been able to find much about it beyond this thread.