Connection hangs and unexpected 15s Idle connection timeout at gateway - Istio 1.6.12

We are using istio ingress gateway in front of a Docker registry (Docker/Distribution) that serves large blobs of data in long-running connections.

When using istio-ingress-gateway we have hit 2 issues.

  1. We appear to often get connection hangs which sometimes resume after an amount of time.
  2. When this happens we have noticed that there appears to be a 15s idle connection timeout to the gateway. We can recreate the 15s timeout with a simple nc -v <IP> 443

We have not configured anything on the gateway in terms of timeout so would not expect the default to be 15s as the envoy and istio documentation show that it should default to much higher than this. We also did a config_dump for istio gateway and found that stream_idle_timeout was set to 0s.

We see neither of these issues when we switch back to an Nginx ingress but leave the rest of the microservices in the mesh, this coupled with the client-side symptom being an EOF makes me think this points to a gateway configuration issue.

We have a reliable recreate of the hangs and EOF (believe this is caused by a hang of more than 15s) which is to do a skopeo copy from one location to another in the same Registry.

We have both the Istio and Nginx Ingress set up at the same time and have tried the following combinations:

Combinations that fail:

  • Both the pull and push going via the Istio ingress gateway
  • Pull via istio and push via Nginx

Combinations that work reliably:

  • Pull via Nginx and Push via Istio
  • Pull from Istio Push to local disk (Possible key difference is skopeo does this one layer at a time in stated 5 layers in parallel)

Config I think will be relevant.

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  creationTimestamp: "2020-05-29T14:59:57Z"
  generation: 2
  name: front-door
  namespace: istio-system

spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    <redacted>
    port:
      name: https
      number: 8443
      protocol: HTTPS
    tls:
      mode: SIMPLE
      privateKey: /etc/istio/ingressgateway-certs/tls.key
      serverCertificate: /etc/istio/ingressgateway-certs/tls.crt

VirtualService

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  annotations:
    meta.helm.sh/release-namespace: default
  generation: 3
  labels:
    app.kubernetes.io/managed-by: Helm
  name: registry-front-door
  namespace: istio-system
spec:
  gateways:
  - front-door
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /v2
    route:
    - destination:
        host: registry-v2.default.svc.cluster.local
        port:
          number: 8080
      headers:
        request:
          set:
            ingress-type: front-door
            x-envoy-force-trace: "true"
        response:
          remove:
          - x-envoy-upstream-service-time
          - x-envoy-force-trace
          - x-server-node
          - server
          set:
            cache-control: no-cache, no-store
            docker-distribution-api-version: registry/2.0
            expires: "0"
            pragma: no-cache
            strict-transport-security: max-age=31536000; includeSubDomains
            x-registry-supports-signatures: "1"
            x-xss-protection: 1; mode=block

Destination Rule

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  labels:
    app.kubernetes.io/managed-by: Helm
  name: default
  namespace: istio-system
spec:
  host: '*.local'
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL

Hi there,

The timeouts are fairly complicated in service mesh. I don’t recall the default behavior in 1.6. In 1.7, we don’t have http timeout (disabled), neither we have stream idle timeout (same as what you have). The connect timeout at tcp layer is 10s by default. You mentioned the istio configuration at the gw, is it possible you are getting timeout from gw to your registry service? Is that communication using http or tcp?

You may want to read up on this - https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts#how-do-i-configure-timeouts and check out all timeouts in the proxy config of your gw and the proxy of the registry service (e.g. using istioctl dashboard envoy ). Would also be good to recreate the issue on 1.7 if possible.

Finally, are you seeing this with large requests only?

HTH,

Lin

Hi Lin

I looked through the timeout settings in the envoy config dump and I couldn’t see anything configured that would cause the 15s timeout based on the links you sent me.

On whether it’s only on large requests, it doesn’t have to be particularly large can recreate with 1mb objects, but it’s true we don’t see this from requests to our management API which is serving much smaller amounts of data. I suspect this is more to do with the length of connection rather than the size of request though.

We have now also moved up to Istio 1.7.4 and are seeing the same behaviour, whilst understanding the 15s timeout is definitely something we need to do, I am more concerned by the root cause which is why the connections are hanging in the first place. It’s still true on Istio 1.7.4 we can consistently recreate the EOF when doing a skopeo copy of an image with 30 1 MB layers, whilst it consistently works using the nginx ingress.

Thanks Jack

Hello, I am having this same issue with my cluster, I am using azure AKS and I have simple apps behind the ingress so no docker registry, but it’s exactly the same behaviour, using istio 1.18.1.
When I check the load balancer status in Azure I can see that the health status is Degraded, only 40% availability. with this message:
Some of your load balancing endpoints may be unavailable. Please see the metrics blade for availability information and troubleshooting steps for recommended solutions.
This is causing unpredictable connections issues to client apps, any ideas on how to fix this? would switching to nginx ingress be a solution?

1 Like

Hello obeyda, did you find what was causing the issue? we might also have something similar with azure AKS. It seems like a connection hangs.

Hello, it was actual a problem with probes in AKS, i had to delete the additional ports in istio’s ingress gateway, and i kept only the 433 and 80 ports, after that i had 100 availability in the load balancer’s health.
so, the action was just to delete the additional (non used) ports in the ingress gateway’s definition.

thank you for the suggestion, but this didn’t help us. we still facing random java.net.SocketTimeoutException: Read timed out on our clients. the issue goes away if we put nginx instead of istio.

Had a similar issue on AWS, Global Accelerator would report as unhealthy instances and CloudFront would intermittently return 504 errors. Removed all ports except 80/443 from the network load balancer and issues seem to have been resolved.

Interestingly, this only happened on the network load balancer. The classic load balancer (default on AWS EKS) seemed to have no issues with CloudFront, although isn’t supported by Global Accelerator. Now to understand the exact reason these ports appear to impact this…