gRPC connection TRANSIENT_FAILURE

Hi there,

We are using Istio as part of a gRPC microservices architecture.
We found that every 5 minutes, a gRPC connection that a client service has established to an upstream service get closed, as shown below:

Blockquote
INFO: 2019/03/14 14:51:42 transport: loopyWriter.run returning. connection error: desc = “transport is closing”
INFO: 2019/03/14 14:51:42 base.baseBalancer: handle SubConn state change: 0xc0001607b0, TRANSIENT_FAILURE
INFO: 2019/03/14 14:51:44 base.baseBalancer: handle SubConn state change: 0xc0001607b0, CONNECTING
INFO: 2019/03/14 14:51:44 roundrobinPicker: newPicker called with readySCs: map
INFO: 2019/03/14 14:51:44 base.baseBalancer: handle SubConn state change: 0xc0001607b0, READY
INFO: 2019/03/14 14:51:44 roundrobinPicker: newPicker called with readySCs: map[{xxxxxxxx:50000 0 }:0xc0001607b0]
INFO: 2019/03/14 14:56:42 transport: loopyWriter.run returning. connection error: desc = “transport is closing”
INFO: 2019/03/14 14:56:42 pickfirstBalancer: HandleSubConnStateChange: 0xc000160b40, TRANSIENT_FAILURE
INFO: 2019/03/14 14:56:42 transport: loopyWriter.run returning. connection error: desc = “transport is closing”
INFO: 2019/03/14 14:56:42 base.baseBalancer: handle SubConn state change: 0xc0001607b0, TRANSIENT_FAILURE
INFO: 2019/03/14 14:56:43 pickfirstBalancer: HandleSubConnStateChange: 0xc000160b40, CONNECTING
INFO: 2019/03/14 14:56:43 pickfirstBalancer: HandleSubConnStateChange: 0xc000160b40, READY
INFO: 2019/03/14 14:56:44 base.baseBalancer: handle SubConn state change: 0xc0001607b0, CONNECTING
INFO: 2019/03/14 14:56:44 roundrobinPicker: newPicker called with readySCs: map[]
INFO: 2019/03/14 14:56:44 base.baseBalancer: handle SubConn state change: 0xc0001607b0, READY
INFO: 2019/03/14 14:56:44 roundrobinPicker: newPicker called with readySCs: map[{xxxxxxxx:50000 0 }:0xc0001607b0]

We noticed on the sidecar container (istio-proxy) that at the sametime, there is a kind of proxy reconfiguration which is removing all listeners, closing connections and creating connections again:

Blockquote
[2019-03-14 14:56:41.965][17][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:102] gRPC config stream closed: 13,
[2019-03-14 14:56:42.013][17][warning][misc] [external/envoy/source/common/protobuf/utility.cc:129] Using deprecated option ‘envoy.api.v2.Listener.use_original_dst’. This configuration will be removed from Envoy soon. Please see https://github.com/envoyproxy/envoy/blob/0286b82280d33d39b5fec7dbc6be34d2fc943c7d/DEPRECATED.md for details.

[2019-03-14 14:56:42.038][17][debug][main] [external/envoy/source/server/server.cc:143] flushing stats
[2019-03-14 14:56:42.185][17][debug][upstream] [external/envoy/source/common/upstream/original_dst_cluster.cc:163] Cleaning up stale original dst hosts.
[2019-03-14 14:56:42.185][17][debug][upstream] [external/envoy/source/common/upstream/original_dst_cluster.cc:163] Cleaning up stale
… removing listener …
[2019-03-14 14:56:42.412][17][debug][config] [external/envoy/source/server/listener_manager_impl.cc:617] removing listener: name=172.30.193.179_5432, hash=2680903410418478357, address=172.30.193.179:5432
[2019-03-14 14:56:42.412][17][debug][config] [external/envoy/source/server/listener_manager_impl.cc:617] removing listener: name=172.30.198.57_15011, hash=5351683613128278624, address=172.30.198.57:15011
… listener removal complete …
[2019-03-14 14:56:42.413][17][debug][config] [external/envoy/source/server/listener_manager_impl.cc:617] listener removal complete: name=172.30.243.191_443, hash=5038682578597857908, address=172.30.243.191:443
[2019-03-14 14:56:42.413][17][debug][config] [external/envoy/source/server/listener_manager_impl.cc:617] listener removal complete: name=172.30.168.120_14267, hash=235074651536250116, address=172.30.168.120:14267
[2019-03-14 14:56:42.413][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:101] [C80] closing data_to_write=0 type=1
[2019-03-14 14:56:42.413][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C80] closing socket: 1
[2019-03-14 14:56:42.413][23][debug][main] [external/envoy/source/server/connection_handler_impl.cc:68] [C80] adding to cleanup list
[2019-03-14 14:56:42.413][23][debug][connection][external/envoy/source/common/network/connection_impl.cc:101] [C81] closing data_to_write=0 type=1
[2019-03-14 14:56:42.413][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C81] closing socket: 1
[2019-03-14 14:56:42.413][23][debug][main] [external/envoy/source/server/connection_handler_impl.cc:68] [C81] adding to cleanup list
[2019-03-14 14:56:42.413][17][debug][config] [external/envoy/source/server/listener_manager_impl.cc:617] listener removal complete: name=172.30.0.1_443, hash=16711819288184588908, address=172.30.0.1:443
[2019-03-14 14:56:42.413][17][debug][config] [external/envoy/source/server/listener_manager_impl.cc:617] listener removal complete: name=0.0.0.0_15014, hash=8249298906339225086, address=0.0.0.0:15014

[2019-03-14 14:56:42.415][17][debug][config] [external/envoy/source/server/listener_manager_impl.cc:617] listener removal complete: name=10.0.0.19_3128, hash=15935048846062630651, address=10.0.0.19:3128
[2019-03-14 14:56:42.415][17][debug][config] [external/envoy/source/server/listener_manager_impl.cc:617] listener removal complete: name=172.30.168.120_14268, hash=2359658510866416823, address=172.30.168.120:14268
[2019-03-14 14:56:42.801][17][debug][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:1307] starting async DNS resolution for zipkin.istio-system
[2019-03-14 14:56:42.802][17][debug][upstream] [external/envoy/source/common/network/dns_impl.cc:158] Setting DNS resolution timer for 5000 milliseconds
[2019-03-14 14:56:42.802][17][debug][upstream] [external/envoy/source/common/network/dns_impl.cc:158] Setting DNS resolution timer for 5000 milliseconds
[2019-03-14 14:56:42.802][17][debug][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:1307] starting async DNS resolution for istio-pilot.istio-system
[2019-03-14 14:56:42.802][17][debug][upstream] [external/envoy/source/common/network/dns_impl.cc:158] Setting DNS resolution timer for 5000 milliseconds

[2019-03-14 14:56:42.809][17][debug][upstream] [external/envoy/source/common/network/dns_impl.cc:158] Setting DNS resolution timer for 5000 milliseconds
[2019-03-14 14:56:42.810][17][debug][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:1314] async DNS resolution complete for zipkin.istio-system
[2019-03-14 14:56:42.810][17][debug][upstream] [external/envoy/source/common/network/dns_impl.cc:158] Setting DNS resolution timer for 5000 milliseconds
[2019-03-14 14:56:42.810][17][debug][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:1314] async DNS resolution complete for istio-pilot.istio-system
[2019-03-14 14:56:43.847][23][debug][main] [external/envoy/source/server/connection_handler_impl.cc:257] [C82] new connection
[2019-03-14 14:56:43.848][23][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:721] [C82] setting stream-level initial window size to 268435456
[2019-03-14 14:56:43.848][23][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:743] [C82] updating connection-level initial window size to 268435456
[2019-03-14 14:56:43.879][17][debug][admin] [external/envoy/source/server/http/admin.cc:1021] [C1][S8552626438515267892] request complete: path: /listeners
[2019-03-14 14:56:43.879][17][debug][admin] [external/envoy/source/server/http/admin.cc:1021] [C1][S14631235835349617749] request complete: path: /stats?usedonly
[2019-03-14 14:56:44.037][23][debug][main] [external/envoy/source/server/connection_handler_impl.cc:257] [C83] new connection
[2019-03-14 14:56:44.037][23][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:721] [C83] setting stream-level initial window size to 268435456
[2019-03-14 14:56:44.037][23][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:743] [C83] updating connection-level initial window size to 268435456

Is that the expected behavior?

We have tested the same apps without Istio and the connection keeps established for ages:

Blockquote
INFO: 2019/03/14 16:03:18 ccResolverWrapper: sending new addresses to cc: [{xxxxxxxxx:50000 0 }]
INFO: 2019/03/14 16:03:18 base.baseBalancer: got new resolved addresses: [{xxxxxxxxx:50000 0 }]
INFO: 2019/03/14 16:03:18 base.baseBalancer: handle SubConn state change: 0xc00011a7d0, CONNECTING
INFO: 2019/03/14 16:03:19 base.baseBalancer: handle SubConn state change: 0xc00011a7d0, READY

How can we avoid that the connection get closed every ‘n’ minutes?

Thanks,

David

Hi,

Just to add some more information, what we see is that every 5 minutes, the connection between the istio-pilot and the sidecar container is closed due to a gRCP Context Canceled.

A summary of the istio-pilot logs is:

2019-03-20T15:29:22.319792Z info ads ADS: “127.0.0.1:38278” router~10.129.2.110~istio-ingressgateway-97f4b5d5d-vsngh.istio-system~istio-system.svc.cluster.local-47280 terminated rpc error: code = Canceled desc = context canceled
2019-03-20T15:29:22.319831Z info ads ADS: “127.0.0.1:38278” router~10.129.3.222~istio-egressgateway-58748db45f-5jnzv.istio-system~istio-system.svc.cluster.local-47279 terminated rpc error: code = Canceled desc = context canceled
2019-03-20T15:29:22.333196Z info ads ADS: “127.0.0.1:38234” sidecar~10.129.0.84~test-app-0.grpc-istio~grpc-istio.svc.cluster.local-47291 terminated rpc error: code = Canceled desc = context canceled

2019-03-20T15:34:22.545998Z info ads ADS: “127.0.0.1:38234” sidecar~10.129.0.84~test-app-0.grpc-istio~grpc-istio.svc.cluster.local-47300 terminated rpc error: code = Canceled desc = context canceled
2019-03-20T15:34:22.624420Z info ads ADS: “127.0.0.1:38278” router~10.129.2.110~istio-ingressgateway-97f4b5d5d-vsngh.istio-system~istio-system.svc.cluster.local-47303 terminated rpc error: code = Canceled desc = context canceled
2019-03-20T15:34:22.775363Z info ads ADS: “127.0.0.1:38278” router~10.129.3.222~istio-egressgateway-58748db45f-5jnzv.istio-system~istio-system.svc.cluster.local-47309 terminated rpc error: code = Canceled desc = context canceled

2019-03-20T15:39:22.849114Z info ads ADS: “127.0.0.1:38278” router~10.129.2.110~istio-ingressgateway-97f4b5d5d-vsngh.istio-system~istio-system.svc.cluster.local-47321 terminated rpc error: code = Canceled desc = context canceled
2019-03-20T15:39:22.953582Z info ads ADS: “127.0.0.1:38278” router~10.129.3.222~istio-egressgateway-58748db45f-5jnzv.istio-system~istio-system.svc.cluster.local-47324 terminated rpc error: code = Canceled desc = context canceled
2019-03-20T15:39:22.958638Z info ads ADS: “127.0.0.1:38234” sidecar~10.129.0.84~test-app-0.grpc-istio~grpc-istio.svc.cluster.local-47323 terminated rpc error: code = Canceled desc = context canceled

The consequence of this is that every istio-proxy gets reloaded (I guess that is trying to get the latest proxy configuration…) and in this process, the connection that the application has established to the proxy get closed, generating a “TRANSIENT_FAILURE” state on the client side…

My question then is. Is this behavior normal? I mean, why the connection context between the sidecar container and istio-pilot is canceled every 5 minutes?

Thanks a lot,

David

thanks for the detailed debugging… this is not normal… could you file an issue on github?
Also which version of istio is this? Is this behavior prevalent in 1.1 as well ?

Hi, I saw similar issues [1] [2] on github already… but they seem to be fixed. I added some comments on them as well.

I have tested on last 1.0.x and also 1.1.0-rc2.

Thanks,

David

[1] https://github.com/istio/istio/issues/6981
[2] https://github.com/istio/istio/issues/10858

Just updated to Istio 1.1.2 and dont see this error anymore.