504 on ALB to EKS node other than Istio gateway is located

Bug Description

I’m using EKS (1.23) and ALB. ALB is terminating TLS with certs provided by ACM.

Using terraform I installed in EKS cluster following helm charts:

  • istio-base
  • istiod
  • gateway

all 1.15.0 version.

Other things configured on cluster:

  • aws_security_group_rules, both ingress and egress, on EKS nodes for ports 15000-15090
  • required k8s namespaces
  • required k8s ingress configuring ALB via alb-controller
  • required ACM certificates for ALB
  • required Route53 DNS entries

All those things are quite common so I do not think there is any weird stuff there. I have it in multiple places configured that way without Istio.

I also added some httpbin Service and Deployment and related Gateway and VirtualService.

In ingress I have 2 paths configured (besides ssl-redirect directive for ALB):

  • /healthz/ready is pointing to status-port
  • and then / is pointing to http2

Ingress-gateway service is NodePort type, as required for this type of setup.

(Important) There is 2 nodes in the cluster.

AWS console Target Group details page shows that 2/2 targets are healthy.

Sooooooo …

When I enter address https://httpbin.somedomain.com every second request gets 504 Gateway Timeout. When I enter https://httpbin.somedomain.com/healthz/ready I get 200 every time. When I increase amount of nodes in cluster to 3, 504 occurs for 2 out of 3 requests.

It’s quite clear to me, that it’s related to ALB round robin over machines … but why? status-port is 200 always.


$ istioctl version
client version: 1.15.0
control plane version: 1.15.0
data plane version: 1.15.0 (3 proxies)
$ kubectl version --short
Client Version: v1.23.2
Server Version: v1.23.7-eks-4721010
$ helm version --short

Additional Information

$ istioctl bug-report

Target cluster context: v2-xxx

Running with the following config: 

istio-namespace: istio-system
full-secrets: false
timeout (mins): 30
include: {  }
exclude: { Namespaces: kube-node-lease,kube-public,kube-system,local-path-storage }
end-time: 2022-09-27 17:29:26.34498 +0200 CEST

Cluster endpoint: https://yyy.yl4.eu-west-1.eks.amazonaws.com
CLI version:
version.BuildInfo{Version:"1.15.0", GitRevision:"e3364ab424b70ca8ee1ca76cb0b3afb73476aaac", GolangVersion:"go1.19", BuildStatus:"Clean", GitTag:"1.15.0"}

The following Istio control plane revisions/versions were found in the cluster:
Revision default:
        Component: "pilot",
        Info:      version.BuildInfo{Version:"1.15.0", GitRevision:"e3364ab424b70ca8ee1ca76cb0b3afb73476aaac", GolangVersion:"go1.19", BuildStatus:"Clean", GitTag:"1.15.0"},

The following proxy revisions/versions were found in the cluster:
Revision default: Versions {1.15.0}

Fetching proxy logs for the following containers:





Fetching Istio control plane information from cluster.

Running istio analyze on all namespaces and report as below:
Analysis Report:
Info [IST0102] (Namespace argocd) The namespace is not enabled for Istio injection. Run 'kubectl label namespace argocd istio-injection=enabled' to enable it, or 'kubectl label namespace argocd istio-injection=disabled' to explicitly mark it as not needing injection.
Info [IST0102] (Namespace default) The namespace is not enabled for Istio injection. Run 'kubectl label namespace default istio-injection=enabled' to enable it, or 'kubectl label namespace default istio-injection=disabled' to explicitly mark it as not needing injection.
Info [IST0118] (Service argocd/argo-cd-argocd-applicationset-controller) Port name webhook (port: 7000, targetPort: webhook) doesn't follow the naming convention of Istio port.


Creating an archive at /Users/zzz/bug-report.tar.gz.
Cleaning up temporary files in /var/folders/l4/82mt4l7x4r5dzp1j4ppxqqzm0000gn/T/bug-report.

I solved it with allowing port 80 to be allowed between machines in EKS node group. I do not understand why does it help TBH.