My team is currently working on our worker node upgrade automation and we’ve hit a snag with the Istio control plane. A frequently suggested pattern for upgrading kubernetes versions on worker nodes is to:
- Spin up a set of worker nodes on the new kubernetes version, doubling the capacity of the current cluster.
- Cordon all old nodes so no new workloads are scheduled on them.
- Drain each node one-by-one, causing the scheduler to start new pods on the new nodes.
- Delete the drained nodes.
However, we’re finding that this process fails with Istio control plane components like Pilot. We’re seeing that pods like Pilot never get rescheduled to the new nodes and deleted; when we ask the node to drain, those pods just remain.
Our guess is that this is happening because those deployments have 1 active pod and a pod disruption budget of 1. And that the drain command isn’t working “smartly” with the pod disruption budget. If so, it would seem like a general kubernetes problem… but I’m still interested if the Istio community is aware of this behavior and has any suggested solutions?
Does anyone out there have a strategy for moving components like Pilot to new worker nodes?