Application HTTP Keep Alive issues with Istio

We have a NodeJS (8.x.x) application that connects to an external service on :443 with a 60s keep alive timeout. Frequently, we receive “socket hang up” when interacting with that external service. I’ve been digging through envoy and istio GH issues and have tried the following, but they do not go away. If anything it makes them more common:

  1. No ServiceEntry (passthrough): least frequent econnreset
  2. ServiceEntry w/ DestinationRule, http/connectionPool/idleTimeout: 60s: frequent econnreset
  3. ServiceEntry w/ DestinationRule, http/connectionPool/maxRequestsPerConnection: 1: frequent econnreset (seems like every hour)

I have not tried tweaking the TCP keep alive settings, but it is very unclear to me if this is even enabled (I don’t see any packets flowing over tcpdump when idle). We don’t have any mesh enabled tcpKeepalive settings from what I can tell, but I have not found a way to verify settings outside of:

istioctl proxy-config cluster <pod> --fqdn '<fqdn>' -ojson

which only shows (per 3):

[
    {
        "name": "outbound|443||<fqdn>",
        "type": "ORIGINAL_DST",
        "connectTimeout": "10s",
        "lbPolicy": "CLUSTER_PROVIDED",
        "maxRequestsPerConnection": 1,
        "circuitBreakers": {
            "thresholds": [
                {
                    "maxConnections": 4294967295,
                    "maxPendingRequests": 4294967295,
                    "maxRequests": 4294967295,
                    "maxRetries": 4294967295
                }
            ]
        },
        "metadata": {
            "filterMetadata": {
                "istio": {
                    "config": "/apis/networking/v1alpha3/namespaces/<ns>/destination-rule/<drName>"
                }
            }
        }
    }
]

The question is do I turn off application level keep alive in Nodejs and leave it to the underlying pool, or is it supposed to “just work”?

I realize that we can create a VirtualService with retryOn settings enabled, but is that required for every external service (we have those settings on our internal services)?

Did you find an answer to this question? Because I observe the same problem on my service mesh.

I could not get a hold of anyone from the community and was unable to figure out the source of the increased ECONNRESETs but it was an issue with the mesh. Our ultimate solution was to exclude outbound 443 traffic from the mesh (via traffic.sidecar.istio.io/excludeInboundPorts) and then all the problems went away (well NodeJS keep-alive still is broken in some cases, but better than adding more abstraction to the mix).

Using the PassthroughCluster (no ServiceEntry), there is no controls over that through Istio as it is a TCP proxy. We did not want to go through and create ServiceEntry’s for each endpoint and let Istio handle the TLS for us which would probably alleviate the issue and you could remove keep alive from the application since Istio would manage the connection pool.

Our hope is that upgrading Istio from 1.4 to latest will fix the underlying issues, whatever they were but we have not tried that yet.