Hello,
we are experiencing intermittently TCP connection reset on our ingress gateway.
We were able to estimate, with a script that makes continuous HTTP GET, it happens 1 in 3500 requests.
We are using Google Cloud and here is, from my understanding, the packet flow:
Client → Cloudflare → GCP Network Load Balancer (via anycast IP) → GKE Node VM → Ingress Gateway → Pod IP
The error Cloudflare reported is: context: while reading h2 header cause: connection reset
Bypassing Cloudflare the situation doesn’t change and the error on my client is: connection refused
We dumped the traffic on the ingress gateway (IP: 10.88.1.230
) and we found a lot of TCP Retransmission at the exact time my request got refused. (10.13.x.x
is the subnet for the Node VMs)
We read that some users experienced a similar issue on the AWS NLB due to the TCP idle timeout, but it looks like on GCP is 10 minutes. Dicas gerais sobre como usar o Compute Engine | Documentação do Compute Engine | Google Cloud
On the Node VM the value is 5 mins:
$ cat /proc/sys/net/ipv4/tcp_keepalive_time
300
istio config:
apiVersion: v1
items:
- apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: gateway
namespace: istio-system
spec:
selector:
istio: ingressgateway
servers:
- hosts:
- '*'
port:
name: http
number: 80
protocol: HTTP
tls:
httpsRedirect: true
- hosts:
- '*'
port:
name: https
number: 443
protocol: HTTPS
tls:
credentialName: ingress-cert
mode: SIMPLE
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
spec:
gateways:
- mesh
- istio-system/gateway
hosts:
- redacted.ns.svc.cluster.local
- example.com
http:
- match:
- uri:
prefix: /
route:
- destination:
host: redacted.ns.svc.cluster.local
port:
number: 80
subset: v1
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
spec:
host: redacted.ns.svc.cluster.local
subsets:
- labels:
version: v1
name: v1
trafficPolicy:
loadBalancer:
simple: LEAST_REQUEST
tls:
mode: DISABLE
version:
GKE v1.22.12-gke.2300 and v1.24.5-gke.600
Istio 1.14.5 and 1.15.3
Do you have any idea on what is happing or any clue on how to debug it?
I can provide the tcpdump pcap file if needed for further debugging.
Thank you