Stale EDS in ingressgateway when pilot get killed


At our company we are having this issue in which ingressgateways fail to receive updated EDS data after one or more pilot instances get killed.

After pilot stops, either by being manually scaled in or through a AWS spot instance termination, the remaining ingressgateways do not get sufficiently updated.
A quick check with istioctl proxy-status reveals that EDS is stuck at "STALE (Never Acknowledged).
However, all other sidecars are show to be “SYNCED”
This state persists after the pilot(s) are brought back, which is confusing us since we believe that all the envoy sidecar/gateways should reconnect to whatever existing pilot instances.

This behaviour being caused by a spot instance termination causes other pods to be killed as well, and since the ingressgateways are not properly updated to reflect the fact it leads to a “no healthy upstream” error being returned to our users.

Is this sort of behaviour expected out of pilot and ingressgateway, or is it something abnormal?

We run Istio on EKS version v1.14.9-eks-502bfb with spot instances as nodes.

Any help debugging this problem would be very much appreciated.

1 Like

We worked around this problem by scheduling our pilot and ingessgateway to an ondemand instance.
We were also able to somewhat reliably reproduce this problem by restarting application pods while under load.

Still not sure what the root cause was.