Hi,
I would like to share a behaviour noticed in my Istio 1.7.6 installation (by istio-operator) under EKS 1.19.
Something happened some days ago in the cluster, istio-operator went down due:
“Failed to update lock: etcdserver: request timed out”
Somehow , requests to Kubernetes api were broken, don’t know how much time, eventually the istio-operator pod rebooted and it tried to reconcile my IstioOperator CR ( I believe, from scratch ).
My issue starts precisely from here. After this reconcile cycle, Istiod is almost everytime performing a full push to envoy causing high load and sometimes I got some intermittent status between STALE and SYNCED in “istioctl proxy-status” output.
Here some logs (filtered to show full pushes) from one of istiod pods:
kubectl logs istiod-1-7-6-7578494858-bts2f -n istio-system -f | grep full=true
2021-08-04T19:18:06.335906Z info ads Push debounce stable[1] 1031: 100.088686ms since last change, 295.935684ms since last push, full=true
2021-08-04T19:18:07.889033Z info ads Push debounce stable[2] 4390: 100.07971ms since last change, 1.14924047s since last push, full=true
2021-08-04T19:18:20.862001Z info ads Push debounce stable[5] 1: 100.083817ms since last change, 100.083699ms since last push, full=true
2021-08-04T19:18:35.137909Z info ads Push debounce stable[6] 2: 100.075999ms since last change, 102.776765ms since last push, full=true
2021-08-04T19:18:52.590424Z info ads Push debounce stable[7] 2: 100.050494ms since last change, 102.124365ms since last push, full=true
2021-08-04T19:21:28.684808Z info ads Push debounce stable[15] 1: 100.097263ms since last change, 100.097063ms since last push, full=true
2021-08-04T19:21:34.314423Z info ads Push debounce stable[16] 1: 111.497025ms since last change, 111.496862ms since last push, full=true
2021-08-04T19:22:25.885228Z info ads Push debounce stable[19] 1: 100.029347ms since last change, 100.029166ms since last push, full=true
2021-08-04T19:22:26.769393Z info ads Push debounce stable[20] 1: 101.740648ms since last change, 101.740481ms since last push, full=true
2021-08-04T19:22:35.631611Z info ads Push debounce stable[21] 2: 102.105451ms since last change, 102.109257ms since last push, full=true
2021-08-04T19:27:26.412355Z info ads Push debounce stable[35] 1: 100.085907ms since last change, 100.08571ms since last push, full=true
2021-08-04T19:27:28.268435Z info ads Push debounce stable[36] 1: 106.471468ms since last change, 106.471376ms since last push, full=true
2021-08-04T19:28:35.156684Z info ads Push debounce stable[37] 2: 100.082805ms since last change, 103.613167ms since last push, full=true
2021-08-04T19:32:27.015869Z info ads Push debounce stable[44] 1: 100.088149ms since last change, 100.087973ms since last push, full=true
2021-08-04T19:32:27.954299Z info ads Push debounce stable[45] 1: 394.129987ms since last change, 394.129811ms since last push, full=true
2021-08-04T19:32:35.218435Z info ads Push debounce stable[46] 2: 102.271007ms since last change, 102.551415ms since last push, full=true
2021-08-04T19:37:27.571122Z info ads Push debounce stable[52] 1: 100.089041ms since last change, 100.088865ms since last push, full=true
2021-08-04T19:37:28.214668Z info ads Push debounce stable[53] 1: 100.867837ms since last change, 100.86774ms since last push, full=true
2021-08-04T19:38:35.273200Z info ads Push debounce stable[58] 2: 100.085911ms since last change, 103.137759ms since last push, full=true
2021-08-04T19:42:28.161449Z info ads Push debounce stable[64] 1: 100.101908ms since last change, 100.101767ms since last push, full=true
2021-08-04T19:42:28.738340Z info ads Push debounce stable[65] 1: 100.776292ms since last change, 100.776199ms since last push, full=true
2021-08-04T19:42:37.699557Z info ads Push debounce stable[66] 2: 106.605806ms since last change, 106.670744ms since last push, full=true
2021-08-04T19:47:28.697826Z info ads Push debounce stable[73] 1: 100.086331ms since last change, 100.086161ms since last push, full=true
2021-08-04T19:47:29.319834Z info ads Push debounce stable[74] 1: 100.446675ms since last change, 100.446531ms since last push, full=true
2021-08-04T19:48:35.148635Z info ads Push debounce stable[84] 2: 100.079648ms since last change, 107.233807ms since last push, full=true
2021-08-04T19:52:29.229943Z info ads Push debounce stable[86] 1: 100.092253ms since last change, 100.092077ms since last push, full=true
2021-08-04T19:52:30.380502Z info ads Push debounce stable[87] 1: 101.658736ms since last change, 101.65864ms since last push, full=true
2021-08-04T19:52:35.903714Z info ads Push debounce stable[88] 2: 491.57607ms since last change, 491.687062ms since last push, full=true
2021-08-04T19:57:30.360509Z info ads Push debounce stable[100] 1: 100.08246ms since last change, 100.082291ms since last push, full=true
2021-08-04T19:57:32.031244Z info ads Push debounce stable[101] 1: 103.108598ms since last change, 103.108492ms since last push, full=true
2021-08-04T19:58:35.297295Z info ads Push debounce stable[102] 2: 100.084336ms since last change, 109.901661ms since last push, full=true
2021-08-04T20:01:04.437796Z info ads Push debounce stable[109] 2: 100.047453ms since last change, 100.057606ms since last push, full=true
2021-08-04T20:01:15.989720Z info ads Push debounce stable[111] 2: 100.048968ms since last change, 100.052153ms since last push, full=true
2021-08-04T20:02:31.085060Z info ads Push debounce stable[112] 1: 100.084317ms since last change, 100.084159ms since last push, full=true
2021-08-04T20:02:31.672113Z info ads Push debounce stable[113] 1: 100.403458ms since last change, 100.403276ms since last push, full=true
2021-08-04T20:02:35.780111Z info ads Push debounce stable[114] 2: 116.470548ms since last change, 116.474565ms since last push, full=true
2021-08-04T20:07:31.616593Z info ads Push debounce stable[124] 1: 100.075469ms since last change, 100.075316ms since last push, full=true
2021-08-04T20:07:32.259031Z info ads Push debounce stable[125] 1: 100.422874ms since last change, 100.422705ms since last push, full=true
2021-08-04T20:08:35.159258Z info ads Push debounce stable[127] 2: 100.080791ms since last change, 100.084015ms since last push, full=true
2021-08-04T20:12:32.166435Z info ads Push debounce stable[134] 1: 100.089552ms since last change, 100.089332ms since last push, full=true
2021-08-04T20:12:32.989946Z info ads Push debounce stable[135] 1: 108.792698ms since last change, 108.792498ms since last push, full=true
2021-08-04T20:12:37.221299Z info ads Push debounce stable[136] 2: 2.171142194s since last change, 2.171145372s since last push, full=true
Additionally, got some timeouts related with Webhook Server:
kubectl logs istiod-1-7-6-7578494858-bts2f -n istio-system -f | grep "context deadline exceeded"
2021-08-04T19:18:14.120379Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:18:18.272533Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:27:32.185917Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:27:35.912628Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:28:40.945862Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:37:37.947551Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T19:37:41.238795Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021-08-04T20:02:44.136993Z warn Secure Webhook Server is not ready: Get “https://:15017/httpsReady”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I was wondering if these timeouts were related with some cpu bottleneck and so, raised cpu request for Istiod + allocated dedicated eks nodes for Istiod but the behaviour remains.
Any clues?