I’ve been struggling with Istio migration from a “helm template + kubectl apply” installation to a more recent approach, with istioctl. Every attempt done here turned out in downtime until rolling out my deployments.
In more detail, I have a Kubernetes cluster (EKS 1.16) with Istio 1.5.10 ( helm template + kubectl apply ) and I’m trying to upgrade it via istioctl, keeping the same version ( just to keep entropy as low as possible ). The procedure was basically generating an IstioOperator yml manually ( because ‘istioctl manifest generate my_helm_values.yml’ didn’t work due multiple gateway’s configuration in my helm values ) and applying it via ‘istioctl upgrade -f my_IstioOperator.yml’.
After applying it , istiod is raised and ingressgateways are rolled out - then my deployments started to fail with 503. Istio-system namespace pods looks like this:
NAME READY STATUS RESTARTS AGE
certmanager-5654dcf46c-dsl7m 1/1 Running 0 19h
istio-citadel-6d8b987cb7-qh4gt 1/1 Running 0 19h
istio-galley-6f9ff8749c-nlw8v 1/1 Running 1 19h
istio-galley-6f9ff8749c-v6ppn 1/1 Running 2 19h
istio-ingressgateway-68db5ff958-gr2wr 1/1 Running 0 14m
istio-ingressgateway-68db5ff958-lg6cl 1/1 Running 0 14m51s
istio-internal-ingressgateway-d6674c97f-65ws4 1/1 Running 0 13m
istio-internal-ingressgateway-d6674c97f-dvpv2 1/1 Running 0 13m
istio-nodeagent-2hfnd 1/1 Running 0 19h
istio-nodeagent-2ljwd 1/1 Running 0 19h
istio-nodeagent-br5tq 1/1 Running 0 19h
istio-nodeagent-rl5dz 1/1 Running 0 18h
istio-nodeagent-xm6h6 1/1 Running 0 18h
istio-pilot-5fb56fc4b5-hptqb 2/2 Running 0 19h
istio-pilot-5fb56fc4b5-n95c7 2/2 Running 0 19h
istio-policy-b4d6c7c6c-d96pq 1/1 Running 0 19h
istio-policy-b4d6c7c6c-wm6fk 1/1 Running 0 19h
istio-sidecar-injector-784c684fcb-7wm6c 1/1 Running 0 19h
istio-sidecar-injector-784c684fcb-pw4jj 1/1 Running 0 19h
istio-telemetry-85947cdbd6-2nl7x 1/1 Running 0 19h
istiod-5c95855b6b-q8sj9 1/1 Running 0 13m42s
prometheus-6f59977d56-wwvx8 2/2 Running 0 20m
My deployments will keep returning 503 until rolling out them ( kubectl rollout restart deployment -n mydeployments ). After this everything is ok with 200 status code.
Is it downtime expected in this particular context? Is it avoidable?