Mixer - telemetry throttling behavior

When mixer is being overwhelmed with telemetry (say, I have thousands of requests passing through the mesh every second), how does it handle it? Can that behavior be changed? Right now I’m seeing a significantly throttled telemetry stream in my adapter, and I would like to understand how that throttling happens and whether I can affect it or know how much is throttled.

That behavior can be changed. There are loadshedding params that mixer takes as command line args. I believe some of those options are currently exposed via helm options. Throttling can be set based on response latency or request limits at the moment. There are metrics to monitor when throttling is happening, and how many requests have been throttled.

It may be the case that the default throttling limits added in 1.1 are too aggressive (100ms latency, iirc), especially if you are seeing a bunch of telemetry dropped.

1 Like

@douglas-reid thank you. I couldn’t find any documentation on this (except regarding the two values that can be edited through Helm charts like you mentioned), so after looking through source code and trying it out, I’m seeing the following:

  • even with –loadsheddingMode set to disabled I’m still not seeing all telemetry at higher volumes. Is there throttling on Envoy’s side? Can that be controlled?

  • –averageLatencyThreshold looks good. It’s kind of hard to verify if it’s exactly correct numerically, but it looks roughly correct.

  • I can’t get –maxRequestsPerSecond to work. No matter how huge a value I set it to - whenever a single request comes in, the mixer immediately prints

warn    loadshedding    Throttled (RateLimit): 'Too many requests have been received by this server in the current time window (limit: 1000000000.000000). Please retry request later.'

and my adapter receives nothing. Here’s my mixer deployment (only the relevant part is shown):

containers:
      - name: mixer
        image: "docker.io/istio/mixer:1.1.1"
        ....
        args:
          ........
          - --maxRequestsPerSecond
          - "1000000000"
          - --loadsheddingMode
          - enforce

I don’t really care about this issue right now, but thought I’d let you know. I will create an issue if my config looks legit.

Thank you very much for helping out, if you could clarify regarding Envoy throttling that’d be awesome.

@Alex I think the bit missing here is to also set --burstSize to something other than 0 (when using --maxRequestsPerSecond). This is sort of buried in the code comments for the option (unfortunately):

If this is set to 0, no traffic will be allowed.

Does that make sense?

1 Like

Indeed,I missed that. Thanks!

Can you please point me to where to look for throttling on the proxy side? I’m definitely not seeing the entire telemetry stream even with loadshedding set to disabled. I’ve scaled out both mixer and my adapter to unreasonable degree to make sure it’s not a CPU issue, and it appears that it’s not. Mixer gives no indication it’s dropping anything (and it shouldn’t), so the proxy must not be sending it.

@Alex can you describe what you are seeing and why you believe that telemetry is “missing” ?

Sometimes, “missing” telemetry is a symptom of some traffic being treated as TCP instead of HTTP, etc.

@douglas-reid I have a very basic setup where HTTP requests are coming in through istio-ingressgateway and are being routed onto a service which is backed by a replica set amazing-product-caller. Each pod from that replica set is then multiplying each incoming request by 10x by making 10 requests to another replica set amazing-product-callee for each incoming request.

There is also a custom out-of-process mixer adapter which is subscribed to tracespans and logs/plots every tracespan it receives from mixer. It is configured to receive every trace span with no limitations.

This is what it looks like after a single request has been issued (mixer/pilot pods are not shown on purpose, even though Check() and Report() tracespans are also being received):
image

The above picture is correct and expected.

At higher request loads, however, I’m starting to see that only a fraction of trace spans is being reported even with load shedding disabled in mixer. I’ve scaled everything out to ensure it’s not being dropped due to insufficient resources, but still I’m only seeing a fraction of trace spans. My load testing suite is showing a certain number of successful requests that have been executed, but I’m only seeing part of that in the adapter.

In this load test, for instance, the cluster has been under a constant load of about 50 requests per second for 15 minutes. The actual number of successful requests issued against the gateway is roughly 45,000.

But the adapter has only received about 22,000:

image

My guess is it must be sampled/throttled on the proxy side, otherwise it doesn’t make sense that mixer instances are not utilizing their resource quotas (and neither does the custom adapter), but trace spans are still somehow dropped.

Any idea what to look for here?

@Alex,

Are you using any in-process adapters that you could use to diff against the out-of-process adapters? Maybe just a prometheus sanity check of number of requests or similar? It is possible that some of the instances are not making it to your OOP adapter from Mixer for some reason.

Other things to check would be the self-monitoring metrics for Mixer (iirc, now exposed on :15014). Look at any mixer metric ending in _errors. There might be dispatch errors.

The next thing to check would be some of the envoy metrics for calls to Mixer. Look at envoy_http_mixer_filter_total_report_calls and related to see if there are any issues and if they are inline with your expectations (note: there should be batching of Reports).

Hope that helps.

1 Like

@douglas-reid
Thank you for the guidance. I’ve checked self-monitoring metrics for one of the Mixer pods. I’m not seeing any dispatch errors, but there are dispatch total count metrics for my adapter as well as a bunch of other adapters (in-process I imagine), and here’s what it looks like:

mixer_runtime_dispatches_total{adapter="<my-adapter>.adapter.istio-system",error="false",handler="h1.handler.istio-system",meshFunction="tracespan.template.istio-system"} 83045
mixer_runtime_dispatches_total{adapter="kubernetesenv",error="false",handler="kubernetesenv.istio-system",meshFunction="kubernetes"} 842342
mixer_runtime_dispatches_total{adapter="prometheus",error="false",handler="prometheus.istio-system",meshFunction="metric"} 83045
mixer_runtime_dispatches_total{adapter="stdio",error="false",handler="stdio.istio-system",meshFunction="logentry"} 83045

So the overall dispatch values match for all adapters (except kubernetesenv has 10x more for some reason), so there’s no reason to believe anything is dropped specifically for the OOP adapter. However, I estimate this value to be about 1/8th of what I would expect (less, in fact, since I would also expect dispatch count to account for Report(), Check() calls, etc.).

I think l need to start looking at Envoy. You mentioned above which metric to check (envoy_http_mixer_filter_total_report_calls), can you please point me to how to get that metric? I’m not seeing any references anywhere in Mixer or Envoy docs. Can you give me some key words please? I’m new to Istio, but I’m even newer to Envoy :slight_smile:

@douglas-reid
I ended up hitting Envoy’s admin endpoint directly from the pod, and here’s what I get after the load has been applied (all values were 0 originally, the pod was fresh):

http_mixer_filter.total_check_cache_hit_accepts: 3684
http_mixer_filter.total_check_cache_hit_denies: 0
http_mixer_filter.total_check_cache_hits: 3684
http_mixer_filter.total_check_cache_misses: 172
http_mixer_filter.total_check_calls: 3856
http_mixer_filter.total_quota_cache_hit_accepts: 0
http_mixer_filter.total_quota_cache_hit_denies: 0
http_mixer_filter.total_quota_cache_hits: 0
http_mixer_filter.total_quota_cache_misses: 0
http_mixer_filter.total_quota_calls: 0
http_mixer_filter.total_remote_call_cancellations: 0
http_mixer_filter.total_remote_call_other_errors: 0
http_mixer_filter.total_remote_call_retries: 0
http_mixer_filter.total_remote_call_send_errors: 0
http_mixer_filter.total_remote_call_successes: 172
http_mixer_filter.total_remote_call_timeouts: 0
http_mixer_filter.total_remote_calls: 172
http_mixer_filter.total_remote_check_accepts: 172
http_mixer_filter.total_remote_check_calls: 172
http_mixer_filter.total_remote_check_denies: 0
http_mixer_filter.total_remote_quota_accepts: 0
http_mixer_filter.total_remote_quota_calls: 0
http_mixer_filter.total_remote_quota_denies: 0
http_mixer_filter.total_remote_quota_prefetch_calls: 0
http_mixer_filter.total_remote_report_calls: 3740
http_mixer_filter.total_remote_report_other_errors: 0
http_mixer_filter.total_remote_report_send_errors: 0
http_mixer_filter.total_remote_report_successes: 3740
http_mixer_filter.total_remote_report_timeouts: 0
http_mixer_filter.total_report_calls: 42416

So no errors are reported, total_report_calls = 42416, total_remote_report_calls = 3740.

Since there are multiple pods serving requests and irrelevant Report() and Check() calls, it’s kind of hard to estimate expected numbers accurately, but the difference between these two values appear to reflect (at least roughly) the difference between what I’m seeing and what I expect to be seeing. What’s the semantics behind these counters? What’s “remote” vs “non-remote”? What do I tweak to control this difference?

@Alex My apologies for a delayed response. I missed the update.

The proxy batches calls to Mixer for Report(). As a result, there are the total number of calls to report (not batched), and the total number of remote calls (batched, sent to Mixer). The ratio should (ideally) be 100:1 (total:remote), so the numbers you are listing seem to be OK.

If you are think there ~42k requests, you should be seeing dispatches to your OOP adapter at the same magnitude. Did you find the mixer_runtime_dispatches_total metric for your adapter?