Upgrading worker nodes with Istio (node drain, pod disruption budgets, and pilot)

My team is currently working on our worker node upgrade automation and we’ve hit a snag with the Istio control plane. A frequently suggested pattern for upgrading kubernetes versions on worker nodes is to:

  1. Spin up a set of worker nodes on the new kubernetes version, doubling the capacity of the current cluster.
  2. Cordon all old nodes so no new workloads are scheduled on them.
  3. Drain each node one-by-one, causing the scheduler to start new pods on the new nodes.
  4. Delete the drained nodes.

However, we’re finding that this process fails with Istio control plane components like Pilot. We’re seeing that pods like Pilot never get rescheduled to the new nodes and deleted; when we ask the node to drain, those pods just remain.

Our guess is that this is happening because those deployments have 1 active pod and a pod disruption budget of 1. And that the drain command isn’t working “smartly” with the pod disruption budget. If so, it would seem like a general kubernetes problem… but I’m still interested if the Istio community is aware of this behavior and has any suggested solutions?

Does anyone out there have a strategy for moving components like Pilot to new worker nodes?

2 Likes

We solved this by increasing number of replicas

Out of curiosity, are you setting min replicas on the HPA to 2 “permanently” when you deploy Istio? Or do you temporarily bump the replicas up for worker node migration and then move it back down?

We just set it permanently

Running just 1 replica of the control plane components is a bad idea anyway, so I’d suggest changing the minimum to 2 or 3 permanently.