Ingress gateway connection refused on rolling restart

Hi, I was successfully using Istio 1.3.3 and have tried upgrading to 1.4.4.
I am finding now that if I curl my application url during a rolling restart of the ingress gateway deployment, there is a period of approx 2-3 minutes where all requests return ‘connection refused’.

The new ingress gateway pod starts up, and in the logs of the new pod I see:

[Envoy (Epoch 0)] [2020-02-14 19:29:27.176][17][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gR │
│ PC config stream closed: 14, no healthy upstream │
│ [Envoy (Epoch 0)] [2020-02-14 19:29:27.176][17][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:54] Un │
│ able to establish new stream │
│ 2020-02-14T19:29:33.191249Z info Envoy proxy is ready

But no requests appear in the logs of the new for 2-3 minutes after, during which I see connection refused errors in my shell. The old pod looks to have drained itself and terminated correctly:

2020-02-14T19:29:47.798373Z info Agent draining Proxy │
│ 2020-02-14T19:29:47.798427Z info Received new config, creating new Envoy epoch 1 │
│ 2020-02-14T19:29:47.798436Z info waiting for epoch 0 to go live before performing a hot restart │
│ 2020-02-14T19:29:47.798563Z info watchFileEvents has successfully terminated │
│ 2020-02-14T19:29:47.798622Z info Watcher has successfully terminated │
│ 2020-02-14T19:29:47.798632Z info Status server has successfully terminated │
│ 2020-02-14T19:29:47.798691Z error accept tcp [::]:15020: use of closed network connection │
│ 2020-02-14T19:29:47.799585Z info Graceful termination period is 5s, starting… │
│ 2020-02-14T19:29:47.799619Z info Epoch 1 starting │
│ 2020-02-14T19:29:47.799699Z info Envoy command: [-c /var/lib/istio/envoy/envoy_bootstrap_drain.json --restart-epoch 1 --drai │
│ n-time-s 45 --parent-shutdown-time-s 60 --service-cluster XXX-ingressgateway --service-node router~xx.xxx.x.XXX~XXXX.svc.cluster.local --max-obj-name-len 189 --local-address-ip-version v4 │
│ --log-format [Envoy (Epoch 1)] [%Y-%m-%d %T.%e][%t][%l][%n] %v -l warning --component-log-level misc:error] │
│ [Envoy (Epoch 0)] [2020-02-14 19:29:47.880][19][warning][main] [external/envoy/source/server/server.cc:633] shutting down admin du │
│ e to child startup │
│ [Envoy (Epoch 0)] [2020-02-14 19:29:47.880][19][warning][main] [external/envoy/source/server/server.cc:639] terminating parent pro │
│ cess │
│ [Envoy (Epoch 1)] [2020-02-14 19:29:47.880][36][warning][main] [external/envoy/source/server/server.cc:354] No admin address given │
│ , so no admin HTTP server started. │
│ 2020-02-14T19:29:52.799747Z info Graceful termination period complete, terminating remaining proxies. │
│ 2020-02-14T19:29:52.799788Z warn Aborting epoch 0… │
│ 2020-02-14T19:29:52.799797Z warn Aborting epoch 1… │
│ 2020-02-14T19:29:52.799802Z warn Aborted all epochs │
│ 2020-02-14T19:29:52.799807Z info Agent has successfully terminated

I am using the default drain settings:

  • –drainDuration
    - ‘45s’ #drainDuration
    - --parentShutdownDuration
    - ‘1m0s’ #parentShutdownDuration
    - --connectTimeout
    - ‘10s’ #connectTimeout
    - --serviceCluster

I am using the default readiness probes:

readinessProbe:
failureThreshold: 30
httpGet:
path: /healthz/ready
port: 15020
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1

I am quite new to Istio. Please can you help point me in the right direction? I saw none of these connection refused errors during a rolling restart with 1.3.3.

Thanks

1 Like

Hi there @tricky

I’m seeing the same problem intermittently during rolling updates.

Did you find a solution for this?

Regards,
Bobby