Istiod full push every 5m

Hi,

I would like to share a behaviour noticed in my Istio 1.7.6 installation (by istio-operator) under EKS 1.19.
Something happened some days ago in the cluster, istio-operator went down due:

“Failed to update lock: etcdserver: request timed out”

Somehow , requests to Kubernetes api were broken, don’t know how much time, eventually the istio-operator pod rebooted and it tried to reconcile my IstioOperator CR ( I believe, from scratch ).

My issue starts precisely from here. After this reconcile cycle, Istiod is almost everytime performing a full push to envoy causing high load and sometimes I got some intermittent status between STALE and SYNCED in “istioctl proxy-status” output.

Here some logs (filtered to show full pushes) from one of istiod pods:

kubectl logs istiod-1-7-6-7578494858-bts2f -n istio-system -f | grep full=true
2021-08-04T19:18:06.335906Z info ads Push debounce stable[1] 1031: 100.088686ms since last change, 295.935684ms since last push, full=true
2021-08-04T19:18:07.889033Z info ads Push debounce stable[2] 4390: 100.07971ms since last change, 1.14924047s since last push, full=true
2021-08-04T19:18:20.862001Z info ads Push debounce stable[5] 1: 100.083817ms since last change, 100.083699ms since last push, full=true
2021-08-04T19:18:35.137909Z info ads Push debounce stable[6] 2: 100.075999ms since last change, 102.776765ms since last push, full=true
2021-08-04T19:18:52.590424Z info ads Push debounce stable[7] 2: 100.050494ms since last change, 102.124365ms since last push, full=true
2021-08-04T19:21:28.684808Z info ads Push debounce stable[15] 1: 100.097263ms since last change, 100.097063ms since last push, full=true
2021-08-04T19:21:34.314423Z info ads Push debounce stable[16] 1: 111.497025ms since last change, 111.496862ms since last push, full=true
2021-08-04T19:22:25.885228Z info ads Push debounce stable[19] 1: 100.029347ms since last change, 100.029166ms since last push, full=true
2021-08-04T19:22:26.769393Z info ads Push debounce stable[20] 1: 101.740648ms since last change, 101.740481ms since last push, full=true
2021-08-04T19:22:35.631611Z info ads Push debounce stable[21] 2: 102.105451ms since last change, 102.109257ms since last push, full=true
2021-08-04T19:27:26.412355Z info ads Push debounce stable[35] 1: 100.085907ms since last change, 100.08571ms since last push, full=true
2021-08-04T19:27:28.268435Z info ads Push debounce stable[36] 1: 106.471468ms since last change, 106.471376ms since last push, full=true
2021-08-04T19:28:35.156684Z info ads Push debounce stable[37] 2: 100.082805ms since last change, 103.613167ms since last push, full=true
2021-08-04T19:32:27.015869Z info ads Push debounce stable[44] 1: 100.088149ms since last change, 100.087973ms since last push, full=true
2021-08-04T19:32:27.954299Z info ads Push debounce stable[45] 1: 394.129987ms since last change, 394.129811ms since last push, full=true
2021-08-04T19:32:35.218435Z info ads Push debounce stable[46] 2: 102.271007ms since last change, 102.551415ms since last push, full=true
2021-08-04T19:37:27.571122Z info ads Push debounce stable[52] 1: 100.089041ms since last change, 100.088865ms since last push, full=true
2021-08-04T19:37:28.214668Z info ads Push debounce stable[53] 1: 100.867837ms since last change, 100.86774ms since last push, full=true
2021-08-04T19:38:35.273200Z info ads Push debounce stable[58] 2: 100.085911ms since last change, 103.137759ms since last push, full=true
2021-08-04T19:42:28.161449Z info ads Push debounce stable[64] 1: 100.101908ms since last change, 100.101767ms since last push, full=true
2021-08-04T19:42:28.738340Z info ads Push debounce stable[65] 1: 100.776292ms since last change, 100.776199ms since last push, full=true
2021-08-04T19:42:37.699557Z info ads Push debounce stable[66] 2: 106.605806ms since last change, 106.670744ms since last push, full=true
2021-08-04T19:47:28.697826Z info ads Push debounce stable[73] 1: 100.086331ms since last change, 100.086161ms since last push, full=true
2021-08-04T19:47:29.319834Z info ads Push debounce stable[74] 1: 100.446675ms since last change, 100.446531ms since last push, full=true
2021-08-04T19:48:35.148635Z info ads Push debounce stable[84] 2: 100.079648ms since last change, 107.233807ms since last push, full=true
2021-08-04T19:52:29.229943Z info ads Push debounce stable[86] 1: 100.092253ms since last change, 100.092077ms since last push, full=true
2021-08-04T19:52:30.380502Z info ads Push debounce stable[87] 1: 101.658736ms since last change, 101.65864ms since last push, full=true
2021-08-04T19:52:35.903714Z info ads Push debounce stable[88] 2: 491.57607ms since last change, 491.687062ms since last push, full=true
2021-08-04T19:57:30.360509Z info ads Push debounce stable[100] 1: 100.08246ms since last change, 100.082291ms since last push, full=true
2021-08-04T19:57:32.031244Z info ads Push debounce stable[101] 1: 103.108598ms since last change, 103.108492ms since last push, full=true
2021-08-04T19:58:35.297295Z info ads Push debounce stable[102] 2: 100.084336ms since last change, 109.901661ms since last push, full=true
2021-08-04T20:01:04.437796Z info ads Push debounce stable[109] 2: 100.047453ms since last change, 100.057606ms since last push, full=true
2021-08-04T20:01:15.989720Z info ads Push debounce stable[111] 2: 100.048968ms since last change, 100.052153ms since last push, full=true
2021-08-04T20:02:31.085060Z info ads Push debounce stable[112] 1: 100.084317ms since last change, 100.084159ms since last push, full=true
2021-08-04T20:02:31.672113Z info ads Push debounce stable[113] 1: 100.403458ms since last change, 100.403276ms since last push, full=true
2021-08-04T20:02:35.780111Z info ads Push debounce stable[114] 2: 116.470548ms since last change, 116.474565ms since last push, full=true
2021-08-04T20:07:31.616593Z info ads Push debounce stable[124] 1: 100.075469ms since last change, 100.075316ms since last push, full=true
2021-08-04T20:07:32.259031Z info ads Push debounce stable[125] 1: 100.422874ms since last change, 100.422705ms since last push, full=true
2021-08-04T20:08:35.159258Z info ads Push debounce stable[127] 2: 100.080791ms since last change, 100.084015ms since last push, full=true
2021-08-04T20:12:32.166435Z info ads Push debounce stable[134] 1: 100.089552ms since last change, 100.089332ms since last push, full=true
2021-08-04T20:12:32.989946Z info ads Push debounce stable[135] 1: 108.792698ms since last change, 108.792498ms since last push, full=true
2021-08-04T20:12:37.221299Z info ads Push debounce stable[136] 2: 2.171142194s since last change, 2.171145372s since last push, full=true

Additionally, got some timeouts related with Webhook Server:

kubectl logs istiod-1-7-6-7578494858-bts2f -n istio-system -f | grep "context deadline exceeded"
2021-08-04T19:18:14.120379Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:18:18.272533Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:27:32.185917Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:27:35.912628Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:28:40.945862Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:37:37.947551Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:37:41.238795Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T20:02:44.136993Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

I was wondering if these timeouts were related with some cpu bottleneck and so, raised cpu request for Istiod + allocated dedicated eks nodes for Istiod but the behaviour remains.

Any clues?

going through troubleshoot I’m trying to find out the reason why Istiod is constantly performing full pushes. I confirmed there is no new services in my cluster. However, I chose a random pod and checked its envoy config through istioctl proxy-status <pod_name>. in order to find out some changes between envoy and istiod conf and I noticed just before full push, I got a diff:

— Istiod Clusters
+++ Envoy Clusters
@@ -603,33 +603,33
}
}
],
“combinedValidationContext”: {
“defaultValidationContext”: {
“matchSubjectAltNames”: [
{
+ “exact”: “spiffe://cluster.local/ns/data-stack/sa/stream-kafka”
+ },
+ {
+ “exact”: “spiffe://cluster.local/ns/crm/sa/default”
+ },
+ {
“exact”: “spiffe://cluster.local/ns/runtime/sa/default”
},
{
“exact”: “spiffe://cluster.local/ns/entitlement-server/sa/es-memcached”
},
{
“exact”: “spiffe://cluster.local/ns/observability/sa/kubecost-cost-analyzer”
},
{
“exact”: “spiffe://cluster.local/ns/observability/sa/default”
},
{
“exact”: “spiffe://cluster.local/ns/rsp-platform/sa/default”
- },
- {
- “exact”: “spiffe://cluster.local/ns/data-stack/sa/stream-kafka”
- },
- {
- “exact”: “spiffe://cluster.local/ns/crm/sa/default”

Still trying to understand this, it seems these changes are adding and remove the same stuff, affecting the order which they are listed

thanks for the awesome information.

What is the istio version did you tested on?

Istio version is 1.7.6

I guess it is the k8s watch failed, so when istio re-list-watch from k8s, it will receive a update event, which triggers full push.