I have a weird networking issue in my EKS clusters (I found this issue replicable in two different EKS clusters terraformed the same way). This started like a random networking issue occurring on some pods; then, I found a reliable way to replicate it.
My cluster has two Ingress Gateways:
Both using NLB. The private ingress is reachable only internally, while the public one is exposed to the public Internet. The mesh has the PeerAuthentication set to strict. Sidecars are injected by default on all namespaces. The ingresses’ pods are running in the same istio-ingress namespace. Each ingress has only one pod running. Then, the cluster has at least three nodes (one for each AZ). Let’s call “NodeA” the node where the private ingress runs.
I deploy an httpbin service with a VirtualService using the private gateway (with a DNS like
httpbin.int.mydomain.com). httpbin is now running on NodeB.
I run two Debian pods. DebianA runs on NodeA, while DebianB runs on another node.
Here is what happens:
- curl httpbin using its Service from DebianA works fine.
- curl httpbin using its Service from DebianB works fine.
- curl httpbin using the VirtualService from DebianA fails with
curl: (35) Recv failure: Connection reset by peer
- curl httpbin using the VirtualService from DebianB work fine.
It seems like when a pod runs in the same node with the private ingress, it cannot reach it. In comparison, the ingress works without problems when I try from other nodes.
I tried running httpbin on NodeA: it is always reachable. So it seems an “outbound problem” only.
I found that nothing is logged on the private ingress pod (with debug level) when trying the “DebianA to httpbin” case. While in all other cases, I see the traffic being correctly logged. So, I suspect the network never reaches the ingress in the first place.
My Istio knowledge is limited, and so far I have no further ideas on how to investigate (and solve) this issue.
Any suggestion is welcome!