Istio Operator 1.6.8, install issues, NLB + Target groups being recreated

The primary problem I am having right now is that every time the operator pod is restarted, it reconciles and ‘ensures’ the services (whatever that is) and in the process the nodePorts are changed… Meaning my NLB target groups are recreated and it goes down for 3 minutes.

I have two ingress gateways configured, istio-ingressgateway (the default ) and an extra istio-clientgateway for a web sockets workload. After it reconciles you can see the istio-system event log:

30m         Normal    SuccessfulCreate               replicaset/istio-clientgateway-65fb884b65     Created pod: istio-clientgateway-65fb884b65-vkw29
30m         Normal    ScalingReplicaSet              deployment/istio-clientgateway                Scaled up replica set istio-clientgateway-65fb884b65 to 1
30m         Normal    Scheduled                      pod/istio-clientgateway-65fb884b65-vkw29      Successfully assigned istio-system/istio-clientgateway-65fb884b65-vkw29 to ip-10-0-21-209.ap-southeast-2.compute.internal
30m         Normal    Started                        pod/istio-clientgateway-65fb884b65-vkw29      Started container istio-proxy
30m         Normal    Pulled                         pod/istio-clientgateway-65fb884b65-vkw29      Container image "docker.io/istio/proxyv2:1.6.8" already present on machine
30m         Normal    Created                        pod/istio-clientgateway-65fb884b65-vkw29      Created container istio-proxy
30m         Normal    Killing                        pod/istio-clientgateway-7d4954c5b5-s5vh6      Stopping container istio-proxy
30m         Normal    ScalingReplicaSet              deployment/istio-clientgateway                Scaled down replica set istio-clientgateway-7d4954c5b5 to 0
30m         Normal    SuccessfulDelete               replicaset/istio-clientgateway-7d4954c5b5     Deleted pod: istio-clientgateway-7d4954c5b5-s5vh6
30m         Normal    EnsuringLoadBalancer           service/istio-ingressgateway                  Ensuring load balancer
30m         Normal    EnsuredLoadBalancer            service/istio-ingressgateway                  Ensured load balancer
30m         Normal    SuccessfulRescale              horizontalpodautoscaler/istio-clientgateway   New size: 2; reason: Current number of replicas below Spec.MinReplicas
30m         Normal    Scheduled                      pod/istio-clientgateway-65fb884b65-tnxp2      Successfully assigned istio-system/istio-clientgateway-65fb884b65-tnxp2 to ip-10-0-3-79.ap-southeast-2.compute.internal
30m         Normal    SuccessfulCreate               replicaset/istio-clientgateway-65fb884b65     Created pod: istio-clientgateway-65fb884b65-tnxp2
30m         Normal    Pulled                         pod/istio-clientgateway-65fb884b65-tnxp2      Container image "docker.io/istio/proxyv2:1.6.8" already present on machine
30m         Normal    Created                        pod/istio-clientgateway-65fb884b65-tnxp2      Created container istio-proxy
30m         Normal    ScalingReplicaSet              deployment/istio-clientgateway                Scaled up replica set istio-clientgateway-65fb884b65 to 2
30m         Normal    Started                        pod/istio-clientgateway-65fb884b65-tnxp2      Started container istio-proxy
30m         Warning   Unhealthy                      pod/istio-clientgateway-65fb884b65-tnxp2      Readiness probe failed: Get http://10.0.14.3:15021/healthz/ready: dial tcp 10.0.14.3:15021: connect: connection refused
29m         Normal    SuccessfulCreate               replicaset/istio-clientgateway-7d4954c5b5     Created pod: istio-clientgateway-7d4954c5b5-vrvwb
29m         Normal    ScalingReplicaSet              deployment/istio-clientgateway                Scaled up replica set istio-clientgateway-7d4954c5b5 to 1
29m         Normal    EnsuringLoadBalancer           service/istio-clientgateway                   Ensuring load balancer
29m         Normal    Scheduled                      pod/istio-clientgateway-7d4954c5b5-vrvwb      Successfully assigned istio-system/istio-clientgateway-7d4954c5b5-vrvwb to ip-10-0-21-209.ap-southeast-2.compute.internal
29m         Normal    Created                        pod/istio-clientgateway-7d4954c5b5-vrvwb      Created container istio-proxy
29m         Normal    Pulled                         pod/istio-clientgateway-7d4954c5b5-vrvwb      Container image "docker.io/istio/proxyv2:1.6.8" already present on machine
29m         Normal    Started                        pod/istio-clientgateway-7d4954c5b5-vrvwb      Started container istio-proxy
29m         Normal    EnsuredLoadBalancer            service/istio-clientgateway                   Ensured load balancer
29m         Normal    SuccessfulDelete               replicaset/istio-clientgateway-65fb884b65     Deleted pod: istio-clientgateway-65fb884b65-vkw29
29m         Normal    SuccessfulCreate               replicaset/istio-clientgateway-7d4954c5b5     Created pod: istio-clientgateway-7d4954c5b5-x7f8j
29m         Normal    Scheduled                      pod/istio-clientgateway-7d4954c5b5-x7f8j      Successfully assigned istio-system/istio-clientgateway-7d4954c5b5-x7f8j to ip-10-0-3-79.ap-southeast-2.compute.internal
29m         Normal    ScalingReplicaSet              deployment/istio-clientgateway                Scaled down replica set istio-clientgateway-65fb884b65 to 1
29m         Normal    ScalingReplicaSet              deployment/istio-clientgateway                Scaled up replica set istio-clientgateway-7d4954c5b5 to 2
29m         Normal    Killing                        pod/istio-clientgateway-65fb884b65-vkw29      Stopping container istio-proxy
29m         Normal    Created                        pod/istio-clientgateway-7d4954c5b5-x7f8j      Created container istio-proxy
29m         Normal    Pulled                         pod/istio-clientgateway-7d4954c5b5-x7f8j      Container image "docker.io/istio/proxyv2:1.6.8" already present on machine
29m         Normal    Started                        pod/istio-clientgateway-7d4954c5b5-x7f8j      Started container istio-proxy
29m         Normal    ScalingReplicaSet              deployment/istio-clientgateway                Scaled down replica set istio-clientgateway-65fb884b65 to 0
29m         Normal    SuccessfulDelete               replicaset/istio-clientgateway-65fb884b65     Deleted pod: istio-clientgateway-65fb884b65-tnxp2
29m         Normal    Killing                        pod/istio-clientgateway-65fb884b65-tnxp2      Stopping container istio-proxy
28m         Warning   FailedComputeMetricsReplicas   horizontalpodautoscaler/istio-clientgateway   invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
28m         Warning   FailedGetResourceMetric        horizontalpodautoscaler/istio-clientgateway   unable to get metrics for resource cpu: no metrics returned from resource metrics API
28m         Normal    SuccessfulRescale              horizontalpodautoscaler/istio-clientgateway   New size: 1; reason: All metrics below target
28m         Normal    ScalingReplicaSet              deployment/istio-clientgateway                Scaled down replica set istio-clientgateway-7d4954c5b5 to 1
28m         Normal    SuccessfulDelete               replicaset/istio-clientgateway-7d4954c5b5     Deleted pod: istio-clientgateway-7d4954c5b5-x7f8j
28m         Normal    Killing                        pod/istio-clientgateway-7d4954c5b5-x7f8j      Stopping container istio-proxy

You can see a lot happening there but the main issue is the ports changing.

kube-proxy-sq767 kube-proxy 2020-08-25T04:38:11.048177621Z I0825 04:38:11.047897       1 service.go:381] Updating existing service port "istio-system/istio-ingressgateway:status-port" at 172.20.52.238:15021/TCP
kube-proxy-sq767 kube-proxy 2020-08-25T04:38:11.049034820Z I0825 04:38:11.047916       1 service.go:381] Updating existing service port "istio-system/istio-ingressgateway:http2" at 172.20.52.238:80/TCP
kube-proxy-sq767 kube-proxy 2020-08-25T04:38:11.049040079Z I0825 04:38:11.047927       1 service.go:381] Updating existing service port "istio-system/istio-ingressgateway:https" at 172.20.52.238:443/TCP
kube-proxy-sq767 kube-proxy 2020-08-25T04:38:11.049043219Z I0825 04:38:11.047935       1 service.go:381] Updating existing service port "istio-system/istio-ingressgateway:tls" at 172.20.52.238:15443/TCP
kube-proxy-sq767 kube-proxy 2020-08-25T04:38:11.106171239Z I0825 04:38:11.104902       1 proxier.go:1609] Opened local port "nodePort for istio-system/istio-ingressgateway:http2" (:32465/tcp)
kube-proxy-sq767 kube-proxy 2020-08-25T04:38:11.106191324Z I0825 04:38:11.104986       1 proxier.go:1609] Opened local port "nodePort for istio-system/istio-ingressgateway:status-port" (:31205/tcp)
kube-proxy-sq767 kube-proxy 2020-08-25T04:38:11.106195214Z I0825 04:38:11.105163       1 proxier.go:1609] Opened local port "nodePort for istio-system/istio-ingressgateway:tls" (:32428/tcp)
kube-proxy-sq767 kube-proxy 2020-08-25T04:38:11.106198584Z I0825 04:38:11.105239       1 proxier.go:1609] Opened local port "nodePort for istio-system/istio-ingressgateway:https" (:32114/tcp)
kube-proxy-sq767 kube-proxy 2020-08-25T04:39:07.730072897Z I0825 04:39:07.729301       1 service.go:381] Updating existing service port "istio-system/istio-clientgateway:status-port" at 172.20.134.13:15021/TCP
kube-proxy-sq767 kube-proxy 2020-08-25T04:39:07.730099512Z I0825 04:39:07.729326       1 service.go:381] Updating existing service port "istio-system/istio-clientgateway:http2" at 172.20.134.13:80/TCP
kube-proxy-sq767 kube-proxy 2020-08-25T04:39:07.730103775Z I0825 04:39:07.729336       1 service.go:381] Updating existing service port "istio-system/istio-clientgateway:https" at 172.20.134.13:443/TCP
kube-proxy-sq767 kube-proxy 2020-08-25T04:39:07.730107387Z I0825 04:39:07.729345       1 service.go:381] Updating existing service port "istio-system/istio-clientgateway:tls" at 172.20.134.13:15443/TCP
kube-proxy-sq767 kube-proxy 2020-08-25T04:39:07.754010667Z I0825 04:39:07.753406       1 proxier.go:1609] Opened local port "nodePort for istio-system/istio-clientgateway:status-port" (:31555/tcp)
kube-proxy-sq767 kube-proxy 2020-08-25T04:39:07.754052420Z I0825 04:39:07.753445       1 proxier.go:1609] Opened local port "nodePort for istio-system/istio-clientgateway:https" (:31353/tcp)
kube-proxy-sq767 kube-proxy 2020-08-25T04:39:07.754057730Z I0825 04:39:07.753701       1 proxier.go:1609] Opened local port "nodePort for istio-system/istio-clientgateway:tls" (:30243/tcp)
kube-proxy-sq767 kube-proxy 2020-08-25T04:39:07.754071920Z I0825 04:39:07.753751       1 proxier.go:1609] Opened local port "nodePort for istio-system/istio-clientgateway:http2" (:31078/tcp)

Another thing is, I have ingress-nginx handling some other things for me… and it and its service doesn’t do anything like this… Even on a controller restart.

So in the end, what would cause an k8s svc to do this and then what in istio is making that change?

ok, found a solution… You can hardcode the nodePort key in the service config.


                  # Istio Gateway feature
                  ingressGateways:
                  - name: istio-clientgateway
                    enabled: true
                    k8s:
                      affinity:
                        podAntiAffinity:
                          preferredDuringSchedulingIgnoredDuringExecution:
                            - weight: 100
                              podAffinityTerm:
                                labelSelector:
                                  matchExpressions:
                                    - key: app
                                      operator: In
                                      values:
                                        - istio-clientgateway
                                topologyKey: kubernetes.io/hostname
                      env:
                        - name: ISTIO_META_ROUTER_MODE
                          value: "sni-dnat"
                        - name: ISTIO_META_USER_SDS
                          value: "true"
                      service:
                        ports:
                          - port: 15021
                            nodePort: 31555
                            targetPort: 15021
                            name: status-port
                          - port: 80
                            nodePort: 31078
                            targetPort: 8080
                            name: http2
                          - port: 443
                            nodePort: 31353
                            targetPort: 8443
                            name: https
                          - port: 15443
                            nodePort: 30243
                            targetPort: 15443
                            name: tls

I believe im hitting a simialr issue – it seems these nodeports are randomly assigned on creation of the serviceobject by the operator? anytime that service object is updated are we getting new nodeports?

are you seeing any unexpected behavior as a result of hardcoding these? (any collisions with other services for example?)

Yes, that seems to be the case.

It is early days so far (only did it yesterday), I just reused the ports that were already assigned to it at the point of time of hard coding them.

So far so good, I have some other issues to resolve now. Just this one (https://github.com/istio/istio/issues/25642) which is also related.

i just applied your fix as well and ti seems to be working, however i plan to roll it out to several clusters so im a bit concerned about setting these ports statically, may reach out to AWS support about that; i will update here if they have any meaningful advice. using AWS NLB i also had to remove port 15443 from the loadbalancer service, as it was always unhealthy

yeah, sure. I highly doubt AWS will help you, this is a Kubernetes concern (the hard coding of ports) Considering that they let you even do it, I would say they have collision detection in the code for the Service Object. https://kubernetes.io/docs/concepts/services-networking/service/#nodeport

I did see that dead port, but I found it to be hardcoded into the Istio config; https://istio.io/v1.4/docs/reference/config/installation-options/#gateways-options

So are the nodePorts it seems, so because we have that block declared in the config, the whole object is getting overwritten so the helm config is being ignored.

In my case the issue was really jsut the lack of these consistent nodeports, which was causing the flapping in the NLB that i was seeing.

removing the 15443 port just removed an unhealthy target from the NLB which enables it to function properly when there are not live targets in all AZs. (this maybe is fixed in later istio versions, i am running 1.6.5 currently)

it does still create an ELB during the inital istio deploy, but subsequent deploys do not and most importantly no longer cause any disruptions in my NLB.

in my travels today, interestingly:

yeah i saw something similar here: Istio Ingress ports 31400 and 15443

Im just serving a single certificate on host: “*” – so i just removed the unneeded SNI port from my gateway