Random TCP connection reset

Hello,
we are experiencing intermittently TCP connection reset on our ingress gateway.
We were able to estimate, with a script that makes continuous HTTP GET, it happens 1 in 3500 requests.
We are using Google Cloud and here is, from my understanding, the packet flow:

Client → Cloudflare → GCP Network Load Balancer (via anycast IP) → GKE Node VM → Ingress Gateway → Pod IP

The error Cloudflare reported is: context: while reading h2 header cause: connection reset
Bypassing Cloudflare the situation doesn’t change and the error on my client is: connection refused

We dumped the traffic on the ingress gateway (IP: 10.88.1.230) and we found a lot of TCP Retransmission at the exact time my request got refused. (10.13.x.x is the subnet for the Node VMs)

We read that some users experienced a similar issue on the AWS NLB due to the TCP idle timeout, but it looks like on GCP is 10 minutes. General tips for using Compute Engine  |  Compute Engine Documentation  |  Google Cloud

On the Node VM the value is 5 mins:

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
300

istio config:

apiVersion: v1
items:
- apiVersion: networking.istio.io/v1beta1
  kind: Gateway
  metadata:
    name: gateway
    namespace: istio-system
  spec:
    selector:
      istio: ingressgateway
    servers:
    - hosts:
      - '*'
      port:
        name: http
        number: 80
        protocol: HTTP
      tls:
        httpsRedirect: true
    - hosts:
      - '*'
      port:
        name: https
        number: 443
        protocol: HTTPS
      tls:
        credentialName: ingress-cert
        mode: SIMPLE
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
spec:
  gateways:
  - mesh
  - istio-system/gateway
  hosts:
  - redacted.ns.svc.cluster.local
  - example.com
  http:
  - match:
    - uri:
        prefix: /
    route:
    - destination:
        host: redacted.ns.svc.cluster.local
        port:
          number: 80
        subset: v1
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
spec:
  host: redacted.ns.svc.cluster.local
  subsets:
  - labels:
      version: v1
    name: v1
    trafficPolicy:
      loadBalancer:
        simple: LEAST_REQUEST
      tls:
        mode: DISABLE

version:

GKE v1.22.12-gke.2300 and v1.24.5-gke.600
Istio 1.14.5 and 1.15.3

Do you have any idea on what is happing or any clue on how to debug it?
I can provide the tcpdump pcap file if needed for further debugging.

Thank you

I don’t actually see a RST in that capture. Can you find the stream that actually has it? If not, maybe try filter a specific stream to see how it behaves. In wireshark you can right-click one of the packets and ‘follow tcp stream’. That may shed some more light on who tore down the session first.

Is there any chance you have a NAT in between the client and your k8s cluster, and you’re exhausting its port space?

Thanks for the reply.

Here’s a single TCP stream:

It looks like is 10.13.0.25 (the VM Node) that resets the connection.

We have a NAT for outbound connection towards Internet, since our GKE nodes don’t have an external IP. It doesn’t look there are issues on the port space, here the metrics for the NAT:


The GCP Network Load Balancer uses DSR:

Responses from the backend VMs go directly to the clients, not back through the load balancer. The industry term for this is direct server return (DSR)

*Times in all screenshot are in UTC. Connection reset time: 17:34:37

Interesting! As you said, look at packet 55089: the remote end (from istio’s perspective) is what closed the connection. This was after a fair amount of data (say 600 bytes) was sent. Is that enough for one of your requests to make it through?

Also interesting is that the FIN comes about 150 milliseconds after the last packet in the stream. That is substantially longer than what appears to be the RTT between the other packets.

Is it possible that there’s a short timeout on these requests, and every once in a while the upstream from istio (or istio itself) takes too long to process it? It’s also interesting that istio has responded with some data (packet 54995). Could that be part of the data stream? Or is that just TLS handshake still going on?

It might be informative to see if you can correlate this request with the logs emitted by istio. I think istio can be told to log out the downstream port and IP, which should be sufficient to find it. Does it think the request finished, and it’s just the client that thinks it didn’t?

This was after a fair amount of data (say 600 bytes) was sent. Is that enough for one of your requests to make it through?

Yes, I think so. My HTTP response is very tiny, looking at the packet the HTTP header x-envoy-peer-metadata is taking a bit of space since it’s quite long.

Is it possible that there’s a short timeout on these requests, and every once in a while the upstream from istio (or istio itself) takes too long to process it?

Do you mean the pod who actually send the response? I’d say no, it’s a very fast and tiny response.
Resources usage is generously below the limit both for the container itself and istio-proxy.

Does it think the request finished, and it’s just the client that thinks it didn’t?

Maybe. Or maybe does the client get error the when tries to establish a new connection?
Here is the access log of the Ingress GW (downstream 10.13.0.35:12600):

jsonPayload: {
  authority: "example.com"
  bytes_received: 0
  bytes_sent: 10
  connection_termination_details: null
  downstream_local_address: "10.88.1.230:8443"
  downstream_remote_address: "10.13.0.35:12600"
  duration: 8
  method: "GET"
  path: "/"
  protocol: "HTTP/1.1"
  request_id: "a2ccbb88-bff6-4b92-86c7-ac494531aa70"
  requested_server_name: null
  response_code: 200
  response_code_details: "via_upstream"
  response_flags: "-"
  route_name: null
  start_time: "2022-11-25T17:34:37.285Z"
  upstream_cluster: "outbound|80|v1|redacted.ns.svc.cluster.local"
  upstream_host: "10.88.6.232:80"
  upstream_local_address: "10.88.1.230:35168"
  upstream_service_time: "8"
  upstream_transport_failure_reason: null
  user_agent: "Go-http-client/1.1"
  x_forwarded_for: "10.13.0.35"
}

What is interesting is that I tried to disable TLS (no HTTPS) and I can’t find the RST packet in dump, even though the situation doesn’t change on the client. (connection refused)

Thank you

EDIT: I just add one more thought, it looks like the connection refused always happen after ~9/10 minutes from the start of my script, which correspond to what Google says is their TCP idle timeout

EDIT2: I’ve also tried to setup a web server that respond very slow (after 120 seconds) and the issue doesn’t happen after 10 minutes

That log means that istio thinks it sent the response out. If the connection had been reset before it finished, it would have returned a response_code of 0, and response_flags of “DC”. Now, obviously there’s a race condition here – I imagine it logs this out before the client ACKs it, so it’s possible the connection dropped before the client could see it. But, it doesn’t look like it’s istio holding on the request or anything, or the upstream not responding at all. The duration is also only 8ms, which seems short enough.

The short duration makes me wonder about the length of time between the FIN and that ACK and FIN in the one capture you posted with the RST in it. The time scale there is out of whack. 8ms vs a couple of hundred? That doesn’t seem right.

Is your client using connection pooling? I wonder – what’s the chance it’s trying to send a request on a closed pool? Does it recycle based on the number of requests served, and at a higher rate (faster server) it exhausts its limit around the 9-10 minute mark? That doesn’t quite align with the observation that the client seems to be the one sending the FIN, though. That tells me that the client closed the connection… which is weird, because if it did, then why would your client see a connection reset error? Maybe you are on to something about an idle timeout, and something in the middle in sending a FIN to both client and server, yet still forwarding some data. Though why it’s getting an idle timeout when traffic is still flowing seems a little weird to me.

Yes, indeed I was using a connection pool. Actually I found another error in my script. :frowning:

I was not properly closing the connection, so each time a new connection was created. That might explain why after a while I got connection refused, maybe too many connections were open by the same client?

Anyway, fixed my script and now I’m not able to reproduce the issue anymore, but I still see the same errors (HTTP 520) on my Cloudflare dashboard. (generated by real clients)

I think I will open a ticket to Google to get some clues from their side.

Thank again for your help

Good find! I wonder if the problem is some state in cloudflare or even in your node that limits the number of concurrent connections per source IP (some sort of collision where the mapping collides leading to invalid state, or just a limit of some form). If you can track that on a per-client basis, it may offer a clue. I know we ruled out NAT earlier, but who knows.

Anyway, good luck with your Google ticket!

@mirobertod please update this issue when/if you receive any info from Google.

Here a few updates:

With the Google support we changed externalTrafficPolicy from Cluster to Local. More info here: Configuring TCP/UDP load balancing  |  Google Kubernetes Engine (GKE)  |  Google Cloud

That solved part of the issue, we now get much less HTTP 520 errors but some of them are still in our log and I found out it is related with the istio-ingressgateway scale-down, specifically during the Envoy draining stage. It looks like some connection are not properly terminated.

I’ve tried to enable EXIT_ON_ZERO_ACTIVE_CONNECTIONS=true and increase terminationGracefulPeriod up to 120 but in my logs I still see:

{"level":"info","time":"2023-01-13T20:23:17.207638Z","msg":"There are still 107 active connections"}
{"level":"info","time":"2023-01-13T20:23:17.222444Z","msg":"There are still 221 active connections"}
{"level":"info","time":"2023-01-13T20:23:17.228375Z","msg":"There are still 268 active connections"}
{"level":"info","time":"2023-01-13T20:23:18.208419Z","msg":"There are still 107 active connections"}
{"level":"info","time":"2023-01-13T20:23:18.222474Z","msg":"There are still 221 active connections"}
{"level":"info","time":"2023-01-13T20:23:18.226469Z","msg":"There are still 267 active connections"}

And after a while, when the terminationGracefulPeriod timeout:

{"level":"warning","time":"2023-01-13T20:23:19.217011Z","scope":"envoy config","msg":"StreamSecrets gRPC config stream closed: 13, "}
{"level":"warning","time":"2023-01-13T20:23:19.217056Z","scope":"envoy config","msg":"StreamSecrets gRPC config stream closed: 13, "}
{"level":"warning","time":"2023-01-13T20:23:19.217091Z","scope":"envoy config","msg":"StreamAggregatedResources gRPC config stream closed: 13, "}
{"level":"warning","time":"2023-01-13T20:23:19.238007Z","scope":"envoy config","msg":"StreamSecrets gRPC config stream closed: 13, "}
{"level":"warning","time":"2023-01-13T20:23:19.239532Z","scope":"envoy config","msg":"StreamSecrets gRPC config stream closed: 13, "}
{"level":"warning","time":"2023-01-13T20:23:19.238057Z","scope":"envoy config","msg":"StreamSecrets gRPC config stream closed: 13, "}
{"level":"warning","time":"2023-01-13T20:23:19.238096Z","scope":"envoy config","msg":"StreamAggregatedResources gRPC config stream closed: 13, "}
{"level":"warning","time":"2023-01-13T20:23:19.239570Z","scope":"envoy config","msg":"StreamSecrets gRPC config stream closed: 13, "}
{"level":"warning","time":"2023-01-13T20:23:19.239661Z","scope":"envoy config","msg":"StreamAggregatedResources gRPC config stream closed: 13, "}

Any clue about that?
Thanks!

PS: this GitHub issue may be related Ingress downtime during rollouts · Issue #26524 · istio/istio · GitHub