Istio Sidecar consuming high CPU

We are running Istio v1.3.0, most of our applications are running fine. However, in one of our deployments the CPU on the sidecar is pegged at 1.0 for the life of the pod. This happens for the majority of the pods in the same deployment but not all. The pod contains nginx, istio-proxy and an application container all of which have normal CPU use except istio-proxy. I turned on debugging on the sidecar however there are not really interesting.

 istio-proxy [2019-09-25 09:45:06.322][27][debug][client] [external/envoy/source/common/http/codec_client.cc:95] [C1145026] response complete
 istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [src/envoy/http/mixer/filter.cc:214] Called Mixer::Filter : onDestroy state: 2
 istio-proxy [2019-09-25 09:45:06.322][27][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:199] [C1145026] response complete
 istio-proxy [2019-09-25 09:45:06.322][27][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:237] [C1145026] moving to ready
 istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [src/envoy/http/mixer/filter.cc:228] Called Mixer::Filter : log
 istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [./src/envoy/http/mixer/report_data.h:144] No dynamic_metadata found for filter envoy.filters.http.rbac
 istio-proxy [2019-09-25 09:45:06.323][27][debug][filter] [src/istio/control/client_context_base.cc:138] Report attributes: attributes {
 istio-proxy   key: "connection.mtls"
 istio-proxy   value {
 istio-proxy     bool_value: false
 istio-proxy   }
 istio-proxy }
 istio-proxy attributes {
 istio-proxy   key: "context.protocol"
 istio-proxy   value {
 istio-proxy     string_value: "http"
 istio-proxy   }
 istio-proxy }
 istio-proxy attributes {
 istio-proxy   key: "context.prox
 istio-proxy [2019-09-25 09:45:06.324][27][debug][connection] [external/envoy/source/common/network/connection_impl.cc:520] [C1145208] remote close
 istio-proxy [2019-09-25 09:45:06.324][27][debug][connection] [external/envoy/source/common/network/connection_impl.cc:190] [C1145208] closing socket: 0
 istio-proxy [2019-09-25 09:45:06.324][27][debug][main] [external/envoy/source/server/connection_handler_impl.cc:80] [C1145208] adding to cleanup list
 istio-proxy [2019-09-25 09:45:06.324][27][debug][filter] [src/envoy/tcp/mixer/filter.cc:100] [C1138233] Called tcp filter onRead bytes: 17
 istio-proxy [2019-09-25 09:45:06.324][28][debug][router] [external/envoy/source/common/router/router.cc:1002] [C637330][S2924433164096281138] upstream headers complete: end_stream=false
 istio-proxy [2019-09-25 09:45:06.324][28][debug][filter] [src/envoy/http/mixer/filter.cc:141] Called Mixer::Filter : encodeHeaders 2
 istio-proxy [2019-09-25 09:45:06.324][28][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1435] [C637330][S2924433164096281138] encoding headers via codec (end_stream=false):

top output shows
image

citadel version: 1.3.0
galley version: 1.3.0
ingressgateway version: 1.3.0
pilot version: 1.3.0
policy version: 1.3.0
sidecar-injector version: 1.3.0
telemetry version: 1.3.0

Any ideas on how I can troubleshoot further?

to eliminate possibilities, is it possible for you to turn off telemetry for this deployment, assuming you can reliably reproduce this issue on that deployment?
Have you enabled anything else like access log servers, or metrics servers, or node-agents, etc. ?

Also see if you can enable CPU profiling on that envoy by calling the localhost:15000/cpuprofiler
https://www.envoyproxy.io/docs/envoy/v1.5.0/operations/admin#get--cpuprofiler

I tried to enable cpu profiling

 http post "http://localhost:15000/cpuprofiler?enable=y"
HTTP/1.1 500 Internal Server Error
cache-control: no-cache, max-age=0
content-type: text/plain; charset=UTF-8
date: Wed, 25 Sep 2019 13:07:57 GMT
server: envoy
transfer-encoding: chunked
x-content-type-options: nosniff

failure to start the profiler

http post "http://localhost:15000/cpuprofiler?enable=n"
HTTP/1.1 200 OK
cache-control: no-cache, max-age=0
content-type: text/plain; charset=UTF-8
date: Wed, 25 Sep 2019 13:07:59 GMT
server: envoy
transfer-encoding: chunked
x-content-type-options: nosniff

OK

Could you share the config dump of the istio-proxy? It can help to diagnose the problem.

I am seeing the same issue since upgrading to istio 1.3.0.

top - 15:41:50 up  3:09,  0 users,  load average: 3.68, 4.81, 6.03
Tasks:   4 total,   1 running,   3 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.8 us,  5.4 sy,  0.0 ni, 76.3 id,  0.0 wa,  0.0 hi,  2.5 si,  0.0 st
KiB Mem : 31901788 total, 17950592 free,  4528476 used,  9422720 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 26490280 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   22 istio-p+  20   0  193388  77520  19112 S  99.0  0.2  16:57.67 envoy
    1 istio-p+  20   0  139064  24584  19876 S   0.3  0.1   0:00.73 pilot-agent
   39 istio-p+  20   0   18480   3544   2852 S   0.0  0.0   0:00.00 bash
   49 istio-p+  20   0   36684   3160   2632 R   0.0  0.0   0:00.00 top

It’s happening only on some pods. As I am writing, I have a deployment of 11 pods and, from prometheus metrics, all istio-proxy cpu usage adds up to 5 cores, meanwhile, my containers use only 0.3. Moreover, I have ~300ops for this service.

Istio-proxy has a request of 100m and a limit of 1. Its concurency is set to 2.

The only thing enabled is telemetry.
Telemetry CPU usage is around 70-80% with more room to grow if needed (has autoscaling on). Also, load shedding is set to 200ms.

On one pod I noticed something: istio-proxy spiked 2 times to 1 core for 2-3 minutes (30 minutes apart) and in its logs I found

[2019-09-26 15:25:05.034][32][warning][filter] [src/istio/mixerclient/report_batch.cc:109] Mixer Report failed with: UNAVAILABLE:Server is currently overloaded. Please try again.
[2019-09-26 15:48:49.642][19][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-09-26 16:00:55.820][33][warning][filter] [src/istio/mixerclient/report_batch.cc:109] Mixer Report failed with: UNAVAILABLE:Server is currently overloaded. Please try again.

the timestamps of this logs matching the cpu increase mentioned before (I know the grpc one is normal).

This behaviour is really annoying because it messes up the deployment’s hpa.

Any hints on how to debug this? Is there a fix other than disabling telemetry?

Thanks!

I was able to workaround by adding an annotation to use sidecar 1.2.6

This may indicate the cause is mixer, but it may also be just a symptom – the Proxy is getting heavily CPU throttled from some reason, so the call to mixer times out

I shared it privately

Fetching the http://localhost:15000/stats and grepping out just the http_mixer_filter counters would be interesting. It would be interesting to compare the output of a good sidecar to the output of a bad sidecar.

Some more info:

CPU usage for istio-proxy last 15 min:

POD-6e02bcfe-b89fcf448-bxdjj
0.012252757355555555
POD-6e02bcfe-b89fcf448-cvz8l
0.9978715736481483
POD-6e02bcfe-b89fcf448-gq7tv
0.010575735514814572
POD-6e02bcfe-b89fcf448-mckq2
0.9801585787740739
POD-6e02bcfe-b89fcf448-sxlwr
0.0170057333555557
POD-6e02bcfe-b89fcf448-zrvvc
0.013413462899999921
POD-6e02bcfe-b89fcf448-x2j7c
0.013887531737036288
POD-6e02bcfe-b89fcf448-xrr84
0.9912969950643712
POD-6e02bcfe-b89fcf448-4trh5
0.005620048394874999
POD-6e02bcfe-b89fcf448-jrfrz
0.01072196898888891
POD-6e02bcfe-b89fcf448-lrrqh
1.0147889508925299
POD-6e02bcfe-b89fcf448-t5t2x
0.9776515858777742
POD-6e02bcfe-b89fcf448-tchfn
0.9916849104733009
POD-6e02bcfe-b89fcf448-wq2xq
0.012120385866666689
POD-6e02bcfe-b89fcf448-b9lxp
1.0061564088592605
POD-6e02bcfe-b89fcf448-bpn6n
0.012320147118518532
POD-6e02bcfe-b89fcf448-gqv9g
0.01152235111250004
POD-6e02bcfe-b89fcf448-rzvn2
0.005425631443624999
POD-6e02bcfe-b89fcf448-z979s
0.012515214583611182
POD-6e02bcfe-b89fcf448-zrvvc: 4h uptime


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 istio-p+  20   0  139064  25616  20832 S   0.0  0.1   0:08.73 pilot-agent
   21 istio-p+  20   0  195436  79920  19144 S   0.0  0.3   6:52.33 envoy
   40 istio-p+  20   0   18480   3468   2768 S   0.0  0.0   0:00.00 bash
   50 istio-p+  20   0   36708   3232   2728 R   0.0  0.0   0:00.00 top


http_mixer_filter.total_check_cache_hit_accepts: 0
http_mixer_filter.total_check_cache_hit_denies: 0
http_mixer_filter.total_check_cache_hits: 0
http_mixer_filter.total_check_cache_misses: 0
http_mixer_filter.total_check_calls: 0
http_mixer_filter.total_quota_cache_hit_accepts: 0
http_mixer_filter.total_quota_cache_hit_denies: 0
http_mixer_filter.total_quota_cache_hits: 0
http_mixer_filter.total_quota_cache_misses: 0
http_mixer_filter.total_quota_calls: 0
http_mixer_filter.total_remote_call_cancellations: 0
http_mixer_filter.total_remote_call_other_errors: 0
http_mixer_filter.total_remote_call_retries: 0
http_mixer_filter.total_remote_call_send_errors: 0
http_mixer_filter.total_remote_call_successes: 0
http_mixer_filter.total_remote_call_timeouts: 0
http_mixer_filter.total_remote_calls: 0
http_mixer_filter.total_remote_check_accepts: 0
http_mixer_filter.total_remote_check_calls: 0
http_mixer_filter.total_remote_check_denies: 0
http_mixer_filter.total_remote_quota_accepts: 0
http_mixer_filter.total_remote_quota_calls: 0
http_mixer_filter.total_remote_quota_denies: 0
http_mixer_filter.total_remote_quota_prefetch_calls: 0
http_mixer_filter.total_remote_report_calls: 61374
http_mixer_filter.total_remote_report_other_errors: 628
http_mixer_filter.total_remote_report_send_errors: 0
http_mixer_filter.total_remote_report_successes: 60746
http_mixer_filter.total_remote_report_timeouts: 0
http_mixer_filter.total_report_calls: 579681
POD-6e02bcfe-b89fcf448-xrr84: 2h uptime


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   22 istio-p+  20   0  194416  78592  19128 S  93.3  0.2 116:57.33 envoy
    1 istio-p+  20   0  139832  24124  20324 S   0.0  0.1   0:03.79 pilot-agent
   42 istio-p+  20   0   18480   3532   2844 S   0.0  0.0   0:00.00 bash
   51 istio-p+  20   0   36708   3152   2648 R   0.0  0.0   0:00.00 top


http_mixer_filter.total_check_cache_hit_accepts: 0
http_mixer_filter.total_check_cache_hit_denies: 0
http_mixer_filter.total_check_cache_hits: 0
http_mixer_filter.total_check_cache_misses: 0
http_mixer_filter.total_check_calls: 0
http_mixer_filter.total_quota_cache_hit_accepts: 0
http_mixer_filter.total_quota_cache_hit_denies: 0
http_mixer_filter.total_quota_cache_hits: 0
http_mixer_filter.total_quota_cache_misses: 0
http_mixer_filter.total_quota_calls: 0
http_mixer_filter.total_remote_call_cancellations: 0
http_mixer_filter.total_remote_call_other_errors: 0
http_mixer_filter.total_remote_call_retries: 0
http_mixer_filter.total_remote_call_send_errors: 0
http_mixer_filter.total_remote_call_successes: 0
http_mixer_filter.total_remote_call_timeouts: 0
http_mixer_filter.total_remote_calls: 0
http_mixer_filter.total_remote_check_accepts: 0
http_mixer_filter.total_remote_check_calls: 0
http_mixer_filter.total_remote_check_denies: 0
http_mixer_filter.total_remote_quota_accepts: 0
http_mixer_filter.total_remote_quota_calls: 0
http_mixer_filter.total_remote_quota_denies: 0
http_mixer_filter.total_remote_quota_prefetch_calls: 0
http_mixer_filter.total_remote_report_calls: 25729
http_mixer_filter.total_remote_report_other_errors: 150
http_mixer_filter.total_remote_report_send_errors: 0
http_mixer_filter.total_remote_report_successes: 25579
http_mixer_filter.total_remote_report_timeouts: 0
http_mixer_filter.total_report_calls: 158434
POD-6e02bcfe-b89fcf448-b9lxp: 1h uptime


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   24 istio-p+  20   0  194416  78644  19176 S  99.3  0.2  78:43.93 envoy
    1 istio-p+  20   0  138808  24108  20132 S   0.3  0.1   0:03.19 pilot-agent
   42 istio-p+  20   0   18480   3520   2836 S   0.0  0.0   0:00.00 bash
   52 istio-p+  20   0   36708   3168   2664 R   0.0  0.0   0:00.00 top


http_mixer_filter.total_check_cache_hit_accepts: 0
http_mixer_filter.total_check_cache_hit_denies: 0
http_mixer_filter.total_check_cache_hits: 0
http_mixer_filter.total_check_cache_misses: 0
http_mixer_filter.total_check_calls: 0
http_mixer_filter.total_quota_cache_hit_accepts: 0
http_mixer_filter.total_quota_cache_hit_denies: 0
http_mixer_filter.total_quota_cache_hits: 0
http_mixer_filter.total_quota_cache_misses: 0
http_mixer_filter.total_quota_calls: 0
http_mixer_filter.total_remote_call_cancellations: 0
http_mixer_filter.total_remote_call_other_errors: 0
http_mixer_filter.total_remote_call_retries: 0
http_mixer_filter.total_remote_call_send_errors: 0
http_mixer_filter.total_remote_call_successes: 0
http_mixer_filter.total_remote_call_timeouts: 0
http_mixer_filter.total_remote_calls: 0
http_mixer_filter.total_remote_check_accepts: 0
http_mixer_filter.total_remote_check_calls: 0
http_mixer_filter.total_remote_check_denies: 0
http_mixer_filter.total_remote_quota_accepts: 0
http_mixer_filter.total_remote_quota_calls: 0
http_mixer_filter.total_remote_quota_denies: 0
http_mixer_filter.total_remote_quota_prefetch_calls: 0
http_mixer_filter.total_remote_report_calls: 20868
http_mixer_filter.total_remote_report_other_errors: 141
http_mixer_filter.total_remote_report_send_errors: 0
http_mixer_filter.total_remote_report_successes: 20727
http_mixer_filter.total_remote_report_timeouts: 0
http_mixer_filter.total_report_calls: 124507

So, there is a list of pods and their CPU usage, then for 3 pods (1 OK, 2 NOT OK) - uptime, top output and mixer data from /stats

So based on those stats, I see no evidence that this is related to Mixer. Some of the 1.3 sidecars are just cranking through the CPU. I think it would be really interesting to see what the Envoy threads are up to via perf top, successive pstack dumps, etc. @Yan_Xue can you take a look?

@crhuber I took a look at the config dump and saw the http_inspector but I don’t believe the sniffing causes the high CPU since the inspection doesn’t cost too much CPU. If you can turn off the sniffing, we can verify that.

@scenusa could you please get the CPU profiling?

Could you give me some directions on how to get the cpu profiling?

I’ve already tried and got this:

➜ curl -X POST "127.0.0.1:15000/cpuprofiler?enable=y"
failure to start the profiler⏎

➜ curl -X POST "127.0.0.1:15000/cpuprofiler?enable=n"
OK

Am I doing something wrong?

You need to create the file /var/log/envoy/envoy.prof with sudo su.

or you can run

sudo su
curl -X POST "127.0.0.1:15000/cpuprofiler?enable=y"
curl -X POST "127.0.0.1:15000/cpuprofiler?enable=n"

Still it doesn’t work…

<kubectl exec into the pod>

istio-proxy@POD-6e02bcfe-b89fcf448-xrr84:/$ sudo su
root@POD-6e02bcfe-b89fcf448-xrr84:/# curl -X POST "127.0.0.1:15000/cpuprofiler?enable=y"
failure to start the profiler
root@POD-6e02bcfe-b89fcf448-xrr84:/# cd /var/log/
root@POD-6e02bcfe-b89fcf448-xrr84:/var/log# mkdir envoy
mkdir: cannot create directory 'envoy': Read-only file system

Hmm… Even though I can start the heapprofiler, I cannot find the output file. Let me figure it out in my local dev.

enableCoreDump is set to false by default so the file system is read only. You can

  1. do manual injection and change readOnlyRootFilesystem to false
    or
  2. change the enableCoreDump to true and do auto injection

Both of them need you to redeploy the pods

Because I don’t want to mess up my production environment changing values, to test if I can get cpu profiling to work, I did the following steps:

  • created another deployment
  • used istioctl kube-inject, saved the output and changed readOnlyRootFilesystem to false
  • applied it to to the cluster
  • kubectl exec into the pod
  • sudo su and created /var/log/envoy/envoy.prof
root@POD-7ccc7f55c8-q8jtj:/# curl -X POST "127.0.0.1:15000/cpuprofiler?enable=y"
failure to start the profiler

Still doesn’t work.

I did some more digging, and this issue is happening only on a few deployments (not for all pods). It happens on ~44 pods out of ~405 which have istio-proxy. Those pods which have high cpu usage on istio-proxy are connected to different pilot pods and run on different nodes.
Moreover, I have a similar cluster (with same workloads and traffic values, same istio version) where same services/workloads as those in the first cluster have high cpu usage too.

Can you create the folder successfully? After creating the folder, could you please try sudo chmod 777 /var/log/envoy as described in this section. I guess Envoy process doesn’t have permission to write to the folder.

export POD=pod-name
kubectl exec "$POD" -c istio-proxy -- sh -c 'sudo mkdir -p /var/log/envoy && sudo chmod 777 /var/log/envoy && curl -X POST -s "http://localhost:15000/heapprofiler?enable=y"'
sleep 15
kubectl exec "$POD" -c istio-proxy -- sh -c 'curl -X POST -s "http://localhost:15000/heapprofiler?enable=n"'
rm -rf /tmp/envoy
kubectl cp "$POD":/var/log/envoy/ /tmp/envoy -c istio-proxy
kubectl cp "$POD":/lib/x86_64-linux-gnu /tmp/envoy/lib -c istio-proxy
kubectl cp "$POD":/usr/local/bin/envoy /tmp/envoy/lib/envoy -c istio-proxy

Let me know if it doesn’t work.