We’re trying to use Istio 1.0.5 in our production cluster with ~750 services and ~8.000 pods, with only Pilot and Citadel enabled. (Actually we want to disable Citadel, if possible, but our testing in a PoC cluster showed that Citadel is needed by Pilot.) Despite large number of pods, we only injected the sidecar to a service with 5 pods.
After three days running, suddenly our service could not be accessed. All HTTP calls returned 5xx errors. We opened istioctl proxy-status
and it showed that LDS and RDS information in the sidecars were stale:
We then looked at Pilot logs showed many errors with this message: Error adding/updating listener <address>: unable to read file: /etc/certs/root-cert.pem
.
Immediately, to recover service, we deleted Citadel pod to restart it, and several seconds after Citadel restarted, we confirmed that the pods were able to serve traffic again.
Because this is a production environment, we didn’t gather much logs while troubleshooting this; our focus was to restore service.
With this information, could anyone help with this:
- How to prevent this to happen again in the future? (e.g. similar error happen again)
- Could we completely disable Citadel?