I’m verifying istio 1.6.3 now.
It appears to be working fine, but only certain Pods frequently output the following log from the istio-proxy container.
istio-proxy 2020-06-25T00:24:28.709648Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The actual access on the istio-proxy container can certainly be more than 1s.
time curl "http://127.0.0.1:15000/stats?usedonly&filter=%5E%28server.state%7Clistener_manager.workers_started%29"
real 0m5.600s
user 0m0.000s
sys 0m0.011s
To try it out, I removed the filter from the above query and accessed it.
We’re seeing this same issue. It appears to be correlated with timeouts from external traffic.
We are running Istio 1.6.8 in Azure AKS clusters in multiple Azure regions and are seeing these across all of our clusters.
Our kubernetes events show readiness failures from all of our ingress gateways:
$ kubectl -n istio-networking get events
LAST SEEN TYPE REASON OBJECT MESSAGE
3m11s Warning Unhealthy pod/root-gateway-6f795d6ddd-2tdb5 Readiness probe failed: Get http://10.100.3.138:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2m48s Warning Unhealthy pod/root-gateway-6f795d6ddd-5gdgk Readiness probe failed: Get http://10.100.2.0:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m11s Warning Unhealthy pod/root-gateway-6f795d6ddd-g5scm Readiness probe failed: Get http://10.100.10.150:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m55s Warning Unhealthy pod/root-gateway-6f795d6ddd-q4bl7 Readiness probe failed: Get http://10.100.0.141:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m10s Warning Unhealthy pod/root-gateway-6f795d6ddd-rd7hg Readiness probe failed: Get http://10.100.4.245:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Looking at the logs for one root-gateway pod, it looks like it toggles between ready and readiness states fairly frequently. Sometimes, it can switch back and forth within a second or two; other times, it’s tens of seconds.
2020-11-25T14:12:56.807999Z info Envoy proxy is ready
2020-11-25T14:13:39.806982Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:13:40.807928Z info Envoy proxy is ready
2020-11-25T14:14:55.576062Z warning envoy config [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamAggregatedResources gRPC config stream closed: 13,
2020-11-25T14:16:07.807146Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:16:09.807107Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:16:10.807974Z info Envoy proxy is ready
2020-11-25T14:16:49.806974Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:16:50.808131Z info Envoy proxy is ready
2020-11-25T14:26:33.451799Z info Subchannel Connectivity change to CONNECTING
2020-11-25T14:26:33.451892Z info transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-11-25T14:26:33.451910Z info pickfirstBalancer: HandleSubConnStateChange: 0xc000dce640, {CONNECTING <nil>}
2020-11-25T14:26:33.451939Z info Channel Connectivity change to CONNECTING
2020-11-25T14:26:33.451933Z info Subchannel picks a new address "istiod.istio-system.svc:15012" to connect
2020-11-25T14:26:33.460759Z info Subchannel Connectivity change to READY
2020-11-25T14:26:33.460791Z info pickfirstBalancer: HandleSubConnStateChange: 0xc000dce640, {READY <nil>}
2020-11-25T14:26:33.460812Z info Channel Connectivity change to READY
2020-11-25T14:35:09.806868Z warn Envoy proxy is NOT ready: failed to get readiness stats: Get "http://127.0.0.1:15000/stats?usedonly&filter=^(server.state|listener_manager.workers_started)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-11-25T14:35:11.081282Z info Envoy proxy is ready
We tried making the above change to disable spec.values.telemetry.v2.prometheus.enabled but it didn’t seem to help much.
These errors are not constant; but when we see them, they may generate thousands of failure over a 5 minute period (this is a graph of log entries containing “Envoy proxy is NOT ready” and “Client.Timeout exceeded while awaiting headers” over the last 12 hours, binned in 5m increments.
It’ll be several hours before we know how that goes. Very curious to know if anyone else has seen this and come up with another response in the meantime. I haven’t been able to find much about it beyond this thread.
However, we did upgrade from 1.6.8 to 1.6.14 which seems to have solved our issue.
In particular, 1.6.10 included this change which has stopped our ingress gateways from toggling between ready and unready states repeatedly.
We believe that this made the difference for us; though we also haven’t been seeing the errors in the envoy logs either so it’s possible/likely that the underlying root cause was also addressed somewhere between 1.6.8 and 1.6.14.
[Edit: Turns out this didn’t solve our issue. We continue investigating.]