Istio-ingressgateway pod downsizing causes 502 responses from loadbalancer (ALB)

gerrit · February 4, 2020, 2:07pm

These are the versions of the tools we are currently using:
Istio v1.4.3 (setup through official helm chart)
Kubernetes v1.15.7 (setup through kops)

We have setup a Kubernetes cluster on AWS using kops
We are using the aws-alb-ingress-controller helm chart for provisioning our ALB loadbalancer as our ingress into the cluster

We terminate our SSL connections on the ALB using ACM

The istio-ingressgateway service is of type NodePort and exposes the traffic port (80) and the status port (15020)
It has externalTrafficPolicy set to Cluster so that all (5) nodes report as healthy to the ALB

Our ALB is configured to forward traffic to the istio-ingressgateway traffic port and perform healthchecks on the status port (HTTP /healthz/ready)

The setup seems to be working fine. We see all nodes as healthy in the ALB and traffic is distributed over all backends.

However, when we apply load to the system (+/- 50 r/s in a randomized fashion) by mocking long-running requests (ie. curl https://some-service.our.domain/wait/750 which simply blocks for 750ms), we see that when we scale down the istio-ingress gateway deployment (eg: from 3 to 2 replicas) itself, in-flight connections are dropped and the loadbalancer (ALB) returns 502 responses.

I was under the impression that the istio-ingressgateway pod would handle the SIGTERM that was sent to it by Kubernetes correctly and would stop accepting new requests on the pod that was terminating, while waiting for gracePeriod seconds before forcefully killing the pod with SIGKILL. However, we see that the pod is immediately killed, causing the 502 connections returned from the ALB

The istio-ingressgateway pods are configured with a readinessProbe:

readinessProbe:
  failureThreshold: 30
  httpGet:
    path: /healthz/ready
    port: 15020
    scheme: HTTP
  initialDelaySeconds: 1
  periodSeconds: 2
  successThreshold: 1
  timeoutSeconds: 1

Are we missing something in our setup which would provide the expected behaviour? Or am I misunderstanding how istio should handle this downsizing?

Thanks in advance

gerrit · February 10, 2020, 2:45pm

Drain duration is set to 5s by default. It can be cusomized by setting

    env:
      TERMINATION_DRAIN_DURATION_SECONDS: 30

on the ingress gateway. By increasing the drain duration my 502 errors vanished during scaling down istio-ingressgateway pods

There is a bug in the helm chart in istio which prevents setting it from the values.yaml file. A pull request to fix the issue has been created: https://github.com/istio/istio/pull/20984

Topic		Replies	Views
Istio mtls for aws alb Security	0	628	December 20, 2023
504 on ALB to EKS node other than Istio gateway is located Networking	1	1426	October 11, 2022
Graceful Shutdown of istio-ingressgateway for AWS NLB Networking	3	3447	April 12, 2022
Istio Ingress Gateway - Visibility into gRPC connections and load balancing Networking	0	1489	October 29, 2020
Noticeable increase in Surgequeuelength and SpilloverCount metrics for Classic Load Balancer	0	339	July 28, 2023

Istio-ingressgateway pod downsizing causes 502 responses from loadbalancer (ALB)

Related topics