Istio telemetry overhead and MixerV2 info

Hello,

Currently for a service mesh with 100k+ mesh-wide requests (istio 1.1.x with policy disabled and mtls off) mixer uses up to 80 cores (only 2 adapters running: kubernetesenv and prometheus). This brings sizeable overhead to any cluster, thus increasing the operational costs.

I have tried to strip down metrics, rules and adapters (promtcpconnectionclosed, promtcpconnectionopen, tcpkubeattrgenrulerule, promtcp, promhttp) and the load only started to decrease when I deleted the kubeattrgenrulerule rule (the kubernetesenv adaptor had the most dispatches - 100k+).

Moreover, I have tried with different requests for the mixer pods, but same results.

Are there any plans to decrease this usage with the following Istio releases (1.2, 1.3)? I read that MixerV2 is coming, but I couldn’t find any docs or info about it - as far as I understood it will be at Envoy level. Will this change the cpu usage of istio-telemetry?

Thank you!

@scenusa the effort to re-architect the policy and telemetry subsystems, loosely called MixerV2, will completely eliminate the istio-telemetry service. In that sense, it will eliminate the CPU usage of istio-telemetry.

Instead of a distinct service, we will deliver tailored extensions to the proxies to achieve the same results. This should result in a significant reduction in CPU costs to achieve the same results (early prototyping and development efforts have shown promising results).

One of the first things we are doing along these lines is push all the environmental metadata into the proxies themselves (eliminating the need for the kubernetesenv adapter altogether). Please see my post with a design proposal for that functionality: Feedback requested: Adding k8s environmental metadata to envoy.

There will be other steps following that, including some work to allow for metadata exchange between proxies and the release of the actual extensions themselves.

You can attend next week’s P&T WG meeting if you would like to discuss the plans more directly.

cc: @mandarjog @kuat

Thank you for clarifications. I have joined istio-team-drive and read about MixerV2. I am looking forward to testing it when it will be available.

@douglas-reid - Is there a plan to fix the performance issues in mixer v1 especially wrt telemetry ? Or is the recommendation to wait for mixer v2 ?

@pnambiar is there a specific characteristic wrt to telemetry in mixer (v1) that you are concerned about, or would like addressed? Is it the overall CPU usage? Do you have a particular set of adapters (perhaps non-default) that you are using (that may be contributing to perceived issues)?

At this stage, there is not much work being devoted to optimizing mixer (as far as I am aware), especially for telemetry collection and processing. There are some abstract ideas for improvements to mixer to reduce overall resource consumption floating around (removing the proxy in front, changing the deployment strategy, etc.) but I don’t think there is anyone with sufficient motivation to tackle them, given the energy (and priority) being spent on the vNext telemetry collection implementation (and the high cost in maintaining feature parity, testing, etc.).

The reason I ask about your specific adapter usage is that our experience with mixer is that a few of the discovered performance issues were attributable to third-party libraries used in adapters (as well as to design of the internal adapter APIs which led to poor implementation of adapters themselves). It is possible that you may be using an adapter with known scalability issues (which we have done an admittedly poor job of vetting and documenting).

All that being said, if there is something that is an immediate need for you, we may be able to recommend mitigation approaches (depending on the ask).

If, however, the idea is more along the lines of “I wish this were better”, we 100% agree (this is the reason behind the Mixer v2 designs) but unfortunately can only recommend waiting at this point.

We have successful prototypes of the basics done; work is now on testing and solidifying those bits. Our aim is to have something usable (alpha) out in the August timeframe (roughly).

Hope that helps.

@douglas-reid - Thanks. I was thinking of the P99 latency caused by telemetry published in the data path. Is there a plan to fix that in mixer v1 ? This is based on a discussion at the recent kubecon around the mixer topic.

I was not at kubecon, so I cannot speak to any discussions that took place there, unfortunately.

There is not any ongoing work to address the nature of the Mixer v1 API or the client interface to that API.

There was some recent work to add knobs to control batch sizes for reporting, however, which should allow you to adjust the tradeoffs between CPU and latency overhead of Report calls to Mixer. However, I would not expect miracles when that is exposed – the current default batch size was selected through some experimentation.

If the p99 latency overhead is a blocker, you may consider going Mixer-less entirely – and using newly exposed options to precisely control proxy telemetry. This will lead to some loss in functionality, but may provide enough to satisfy your requirements in the meantime. I’ve been meaning to provide some examples of ways to manipulate prom scrape config to drive a portion of the existing dashboards just from the proxy metrics as a proof of concept for such an operation. I just need to find some spare time :).

Adding @mandarjog, as he can speak in more detail into performance tuning. I’ll point to the existing Istio performance page on latency (only at p90, however) in the meantime.

Folks,
In the context of performance, we, as part of the Layer5 community are working on an OSS project called Meshery which will be helpful to answer performance related questions.
Please feel free to give it a try and share your valuable comments with us in Slack.

@douglas-reid I have installed istio 1.3.0-rc.0 (with helm) on a test cluster. I would like to view the metrics produced by MixerV2 . How can I do that?

LE: I’ve added these 2 EnvoyFilters and sidecar.istio.io/statsInclusionPrefixes: istio on my pods, but on istio-proxy logs I am getting the following warnings and no stats:

[2019-08-28 15:46:40.894][30][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1727] wasm log: cannot get metadata key: MESH_ID
[2019-08-28 15:46:40.894][30][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1727] wasm log: cannot get metadata key: PLATFORM_METADATA
[2019-08-28 15:46:40.895][30][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1727] wasm log: cannot get metadata key: CANONICAL_TELEMETRY_SERVICE
[2019-08-28 15:46:40.895][30][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1727] wasm log: cannot get metadata key: MESH_ID

and

[2019-08-28 15:52:37.807][46][warning][wasm] [external/envoy/source/extensions/common/wasm/wasm.cc:1727] wasm log: [extensions/stats/plugin.cc:154]::getPeerById() cannot get metadata for: envoy.wasm.metadata_exchange.downstream_id

Thank you!

@scenusa please hang tight for the documentation. it should be coming shortly. i believe there will be a kubectl apply -f command for the appropriate EnvoyFilter config to run, and that should be about it. there is some final tweaking of the metrics format / dashboard queries (we want to figure out the best way to ensure dashboard continuity), but it is almost ready to start being played with.

Thank you, I will wait for the docs.

@ douglas-reid https://istio.io/docs/ops/telemetry/in-proxy-service-telemetry/ provider the appropriate EnvoyFilter config. But I do not find spec.filters.filterConfig defination in https://istio.io/docs/reference/config/networking/v1alpha3/envoy-filter/#EnvoyFilter, it only has
workloadSelector and configPatches field.

@scenusa: We are also at similar RPS level mesh wide to you at present (and planning to scale further).

Do you have any advice on configuration (MixerV1) to reduce the resource usage, yet still have basic telemetry. The CPU usage is severely problematic once we scale up, and dwarfs application usage.

Regarding the sidecar.istio.io/statsInclusionPrefixes annotation, we experimented with that but noticed that it only added to the list of prefixes, and not replaced them (verified using istioctl proxy-config bootstrap).
I suspect this is related to a PR that tried to ensure there is always a minimal set of stats collected (https://github.com/istio/istio/issues/14204#issuecomment-493446357)

@niall From my experience, there is no way to reduce MixerV1 CPU usage when there are thousands of requests mesh-wide.
I’ve tried to disabled things, and the only one that had a significant decrease on usage was when I disabled the kubernetes-env adapter - the one that adds kubernetes metadata to telemetry - thus, leaving me without important labels on telemetry (such as workload names, labels, service names, etc.) .

This doesn’t affect the MixerV1, because those are proxy generated stats, and from my knowledge those are not sent to the MixerV1, but they are available to query from the stats-port of the istio-proxy.

Thanks for the response. There is a massive difference in performance with telemetry turned on/off once you scale to large number of requests.

Since telemetry is one of the most important features of Istio, this is really disappointing. As it currently stands we can turn on telemetry on initial deployment to understand how our deployment work, but once we scale up, we need to disable telemetry.

it is not only the CPU usage that affects us, but a huge affect on system latency also.

I experimented with the new reportBatchMaxEntries setting in v1.3.0, but creating larger batches (to default of 100) only decreased performance of the overall system (latency and achievable RPS). Reducing batch size to say 50, gave a slight improvement (where we notice lots more requests to the mixer of smaller size).

@douglas-reid For a base install system (with no 3rd party adapters), do you have any recommendations on achieving better performance with telemetry on. I understand that mixerv1 design will result in massive CPU usage in the telemetry mixer. Of more concern is the huge affect on traffic latency at the proxy level.

Note there is sufficient CPU on the test system to cater for mixer pods, which have been scaled up. The number of RPS to telemetry is minor compared to application traffic through the proxy (50 v 1k in one sample test).

@niall This is a complex question. I will do my best to address some of it here, but my suggestion would be to reach out to the Performance and Scalability WG directly for better insights on tuning.

My base recommendations for Mixer at scale typically involve:

  • turning off any handlers that you can live without. (for example, only configured prometheus and kubernetesenv). I think you are probably already doing this based on your comments.
  • setting relatively low load-shedding thresholds. (i think we default to shedding load when average response latency is > 100ms). there are some other tuning options that are available, but not exposed through helm. the docs on mixs cover them briefly.
  • scale istio-telemetry pods horizontally (aka add a lot more pods). we’ve scaled mixer out in long-running, large scale test clusters with success. It sounds like you’ve scaled up the Mixer deployments. Have you tried creating a lot of less beefy instances?

If that doesn’t meet your needs, however, my simplest suggestion is to disable Mixer altogether. You can enable more Envoy stats and get most of the telemetry that Mixer would provide out-of-the-box. Everything will be cluster-based, etc., but it will provide an ability to monitor the mesh as you scale up.

@mandarjog generated a brief overview of using just Envoy-native telemetry awhile back. Probably worth a look, while we wait for the rest of the in-proxy telemetry work to mature.

My other suggestion would be to attend the next P&T WG meeting, where we can have a fuller discussion.

You are right to point out that the example configuration files are using now deprecated parts of the config. While those portions are published as part of the official docs, they are supported for now. See: https://github.com/istio/api/blob/2387a8cbe150ae27b2372a22f3d2228625cd664d/networking/v1alpha3/envoy_filter.proto#L321. We’ll update the config to match the new style as the functionality matures.

1 Like