Istio telemetry overhead and MixerV2 info


Currently for a service mesh with 100k+ mesh-wide requests (istio 1.1.x with policy disabled and mtls off) mixer uses up to 80 cores (only 2 adapters running: kubernetesenv and prometheus). This brings sizeable overhead to any cluster, thus increasing the operational costs.

I have tried to strip down metrics, rules and adapters (promtcpconnectionclosed, promtcpconnectionopen, tcpkubeattrgenrulerule, promtcp, promhttp) and the load only started to decrease when I deleted the kubeattrgenrulerule rule (the kubernetesenv adaptor had the most dispatches - 100k+).

Moreover, I have tried with different requests for the mixer pods, but same results.

Are there any plans to decrease this usage with the following Istio releases (1.2, 1.3)? I read that MixerV2 is coming, but I couldn’t find any docs or info about it - as far as I understood it will be at Envoy level. Will this change the cpu usage of istio-telemetry?

Thank you!

@scenusa the effort to re-architect the policy and telemetry subsystems, loosely called MixerV2, will completely eliminate the istio-telemetry service. In that sense, it will eliminate the CPU usage of istio-telemetry.

Instead of a distinct service, we will deliver tailored extensions to the proxies to achieve the same results. This should result in a significant reduction in CPU costs to achieve the same results (early prototyping and development efforts have shown promising results).

One of the first things we are doing along these lines is push all the environmental metadata into the proxies themselves (eliminating the need for the kubernetesenv adapter altogether). Please see my post with a design proposal for that functionality: Feedback requested: Adding k8s environmental metadata to envoy.

There will be other steps following that, including some work to allow for metadata exchange between proxies and the release of the actual extensions themselves.

You can attend next week’s P&T WG meeting if you would like to discuss the plans more directly.

cc: @mandarjog @kuat

Thank you for clarifications. I have joined istio-team-drive and read about MixerV2. I am looking forward to testing it when it will be available.

@douglas-reid - Is there a plan to fix the performance issues in mixer v1 especially wrt telemetry ? Or is the recommendation to wait for mixer v2 ?

@pnambiar is there a specific characteristic wrt to telemetry in mixer (v1) that you are concerned about, or would like addressed? Is it the overall CPU usage? Do you have a particular set of adapters (perhaps non-default) that you are using (that may be contributing to perceived issues)?

At this stage, there is not much work being devoted to optimizing mixer (as far as I am aware), especially for telemetry collection and processing. There are some abstract ideas for improvements to mixer to reduce overall resource consumption floating around (removing the proxy in front, changing the deployment strategy, etc.) but I don’t think there is anyone with sufficient motivation to tackle them, given the energy (and priority) being spent on the vNext telemetry collection implementation (and the high cost in maintaining feature parity, testing, etc.).

The reason I ask about your specific adapter usage is that our experience with mixer is that a few of the discovered performance issues were attributable to third-party libraries used in adapters (as well as to design of the internal adapter APIs which led to poor implementation of adapters themselves). It is possible that you may be using an adapter with known scalability issues (which we have done an admittedly poor job of vetting and documenting).

All that being said, if there is something that is an immediate need for you, we may be able to recommend mitigation approaches (depending on the ask).

If, however, the idea is more along the lines of “I wish this were better”, we 100% agree (this is the reason behind the Mixer v2 designs) but unfortunately can only recommend waiting at this point.

We have successful prototypes of the basics done; work is now on testing and solidifying those bits. Our aim is to have something usable (alpha) out in the August timeframe (roughly).

Hope that helps.

@douglas-reid - Thanks. I was thinking of the P99 latency caused by telemetry published in the data path. Is there a plan to fix that in mixer v1 ? This is based on a discussion at the recent kubecon around the mixer topic.

I was not at kubecon, so I cannot speak to any discussions that took place there, unfortunately.

There is not any ongoing work to address the nature of the Mixer v1 API or the client interface to that API.

There was some recent work to add knobs to control batch sizes for reporting, however, which should allow you to adjust the tradeoffs between CPU and latency overhead of Report calls to Mixer. However, I would not expect miracles when that is exposed – the current default batch size was selected through some experimentation.

If the p99 latency overhead is a blocker, you may consider going Mixer-less entirely – and using newly exposed options to precisely control proxy telemetry. This will lead to some loss in functionality, but may provide enough to satisfy your requirements in the meantime. I’ve been meaning to provide some examples of ways to manipulate prom scrape config to drive a portion of the existing dashboards just from the proxy metrics as a proof of concept for such an operation. I just need to find some spare time :).

Adding @mandarjog, as he can speak in more detail into performance tuning. I’ll point to the existing Istio performance page on latency (only at p90, however) in the meantime.

In the context of performance, we, as part of the Layer5 community are working on an OSS project called Meshery which will be helpful to answer performance related questions.
Please feel free to give it a try and share your valuable comments with us in Slack.