We are using Istio 1.5.4 and have started using a headless service to get direct pod to pod communication working with Istio mTLS. This is all working fine, but we have recently noticed that after killing one of our pods we get 503 no healthy upstream errors for a very long time afterwards (many minutes). If we go back to a ‘normal’ service we get a few 503 errors and then the problem is fixed very quickly. We have traced the communications of the envoy container using kubectl sniff and can see that connections are maintained for a long period after receiving 503’s, and even that new connections are established to the previously killed pod IP.
We have got the circuit breaker configuration on a destination rule for the service in question, and that doesn’t seem to have helped either. We have also tried setting ‘PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES’ which seemed to improve the 503 errors situation, but strangely interfered with pod to pod direct IP configuration.
Does anyone have any suggestions on why we were receiving the 503 errors or how to avoid them?