Application HTTP Keep Alive issues with Istio

Chris_Barton · April 21, 2020, 8:09pm

We have a NodeJS (8.x.x) application that connects to an external service on :443 with a 60s keep alive timeout. Frequently, we receive “socket hang up” when interacting with that external service. I’ve been digging through envoy and istio GH issues and have tried the following, but they do not go away. If anything it makes them more common:

No ServiceEntry (passthrough): least frequent econnreset
ServiceEntry w/ DestinationRule, http/connectionPool/idleTimeout: 60s: frequent econnreset
ServiceEntry w/ DestinationRule, http/connectionPool/maxRequestsPerConnection: 1: frequent econnreset (seems like every hour)

I have not tried tweaking the TCP keep alive settings, but it is very unclear to me if this is even enabled (I don’t see any packets flowing over tcpdump when idle). We don’t have any mesh enabled tcpKeepalive settings from what I can tell, but I have not found a way to verify settings outside of:

istioctl proxy-config cluster <pod> --fqdn '<fqdn>' -ojson

which only shows (per 3):

[
    {
        "name": "outbound|443||<fqdn>",
        "type": "ORIGINAL_DST",
        "connectTimeout": "10s",
        "lbPolicy": "CLUSTER_PROVIDED",
        "maxRequestsPerConnection": 1,
        "circuitBreakers": {
            "thresholds": [
                {
                    "maxConnections": 4294967295,
                    "maxPendingRequests": 4294967295,
                    "maxRequests": 4294967295,
                    "maxRetries": 4294967295
                }
            ]
        },
        "metadata": {
            "filterMetadata": {
                "istio": {
                    "config": "/apis/networking/v1alpha3/namespaces/<ns>/destination-rule/<drName>"
                }
            }
        }
    }
]

The question is do I turn off application level keep alive in Nodejs and leave it to the underlying pool, or is it supposed to “just work”?

I realize that we can create a VirtualService with retryOn settings enabled, but is that required for every external service (we have those settings on our internal services)?

cabrinoob · May 26, 2021, 9:56am

Did you find an answer to this question? Because I observe the same problem on my service mesh.

Chris_Barton · May 26, 2021, 5:34pm

I could not get a hold of anyone from the community and was unable to figure out the source of the increased ECONNRESETs but it was an issue with the mesh. Our ultimate solution was to exclude outbound 443 traffic from the mesh (via traffic.sidecar.istio.io/excludeInboundPorts) and then all the problems went away (well NodeJS keep-alive still is broken in some cases, but better than adding more abstraction to the mix).

Using the PassthroughCluster (no ServiceEntry), there is no controls over that through Istio as it is a TCP proxy. We did not want to go through and create ServiceEntry’s for each endpoint and let Istio handle the TLS for us which would probably alleviate the issue and you could remove keep alive from the application since Istio would manage the connection pool.

Our hope is that upgrading Istio from 1.4 to latest will fix the underlying issues, whatever they were but we have not tried that yet.

Topic		Replies	Views
Connection Pool TLS External Service	1	2009	June 22, 2020
Istio Sidecar Proxy stuck FIN_WAIT2	0	918	July 29, 2019
Connection hangs and unexpected 15s Idle connection timeout at gateway - Istio 1.6.12 Networking	7	7877	March 10, 2021
EnvoyFilter to fix 503 errors with nodejs app (5sec KeepAlive) Networking	2	1418	December 26, 2022
Istio ingress and AWS ALB idle timeout Networking	5	2017	March 25, 2019

Application HTTP Keep Alive issues with Istio

Related topics