Istio Ingress Gatewaying Queuing Requests for 10+ seconds?

fbcbarbosa · December 12, 2019, 9:27pm

Hey everyone,

So we’ve recently enabled the tracing options for Istio in our clusters, and I’ve noticed that the ingress-gateway seems to be holding/queuing up requests for several seconds at a time

For example, here this request seems to have been held for 10 seconds at the ingress gateway, before being passed ahead to the “mini-main” service in the “prod” namespace (this could’ve been a network issue, but seems unlikely, as I’ve only observed it specifically at hops between the ingress gateway and other services…).

We were running only 4 gateways (and their CPU usage was very low). After this I’ve tried scaling them up to 12 but saw now meaningful reduction in the p90, p99 latency, and the behavior persisted. I’ve considered that instead this could be a problem with the tracing data – but this is generated by Istio as well.

Anyway, is there maybe some fine tuning I could do, like some undocumented request queue size I could edit? Maybe someone has had a similar issue? Alternatively, is there a way I could get metrics from the ingress gateway (like some sorte of request queue size, etc?).

Thanks!

Edit: I did find this post which might be helpful and I did realize that I could start to scrape Prometheus metrics directly from the envoy sidecars in the ingress gateway. I guess monitoring the rq_total, rq_active and downstream_rq_active metrics might help in this case will try that out tomorrow (we have istio metrics in Prometheus but didn’t realize until now how useful raw Envoy metrics would be).

Also, scaling up further actually might have helped, but it might be a bit early to tell.

fbcbarbosa · December 13, 2019, 9:05pm

Yeah, I’ve managed to load the metrics into prometheus, but they don’t seem all that helpful. As a matter of fact, the request (rq_) and connection (cx_) don’t seem to reflect the connections in the cluster. I’m not sure what they are meant to reflect, really.

For example, envoy_cluster_upstream_rq_total (which should be the “total requests” at the proxy) is increasing painfully slowly at the clusters, despite having thousands of requests per minute.

So I guess these metrics don’t really help.

The original problem seems to be gone (although this is hard without a sure way to monitor pending connections) but I’m running 25 ingress gateways and allocating 6GB RAM for them all which seems an overkill

pandey-adarsh147 · July 18, 2020, 4:00pm

Did you find any luck to reduce queuing request? We are facing similar issue.

Topic		Replies	Views
Ingress-gateway cpu usage stuck at 100% vcpu allocated 1.4.6 Performance and Scalability	17	2805	March 25, 2020
Ingress gateway pods takes ages to forward traffic after upgrading to v1.1.x Networking	1	885	May 16, 2019
Istio-ingressgateway tuning for TLS termination Performance and Scalability	3	1609	March 23, 2019
Request for Information on Traffic-management for the services not receiving any traffic through the Ingress Gateway	0	292	April 22, 2021
Istio-ingressgateway High-CPU Performance and Scalability	5	3488	November 6, 2019

Istio Ingress Gatewaying Queuing Requests for 10+ seconds?

Related topics