Business value of distributed tracing

Hi,

I would like to get feedback from the community about the way that you leverage the distributed tracing in Istio to find issues in your web applications. Let me elaborate on this regard.

Currently, when you install Istio you get default distributed tracing for free where you can display the Jaeger’s UI and see “stuff” going on. However, I’m more than skeptical on whether that provides actual value, mainly due to the fact that there is a huge amount of noise introduced by traces and spans coming and going from the Istio’s control plane components, i.e. mixer, policy, ingress, etc.

Let me share a stupid example about what I mean here. I have a simple application consisting of a UI that basically calls a backend to retrieve information (data indexed in Elasticsearch) and displays it to the user in a grid way. After using the application for a while, I go to Jaeger and when I click into a trace, I get 10 spans out of which only 2 of them provide real value in terms of the business transaction I’ve made (and even 1 of those 2 is actually the internal call of the sidecar):

image

Therefore making it quite hard to see clearly what’s going on.

Is anyone facing this kind of thought? I would love to hear how you out there are really benefiting from all this magic provided by Istio.

Thanks!

A few thoughts to share here as part of this discussion:

  1. We’ve been working hard to reduce the trace span data generated by the Istio components themselves in the forthcoming 1.1 release. One big change is to remove the tracing config from the proxy in front of Mixer (meaning that in your example, the istio-policy span will disappear entirely). This also means that it is possible now to disable tracing within the Mixer component entirely via editing commandline flags in the deployment spec. Another change is to have Mixer better respect the prior sampling decisions (meaning no spans will be generated for Report calls).

    Combined with improved caching in the proxies, the number of times you see the calls out to the istio-policy service (with the additional mixer spans) should be minimized. A recent PR dealing with mixer tracing required sending 1000+ requests to generate a single distributed trace with istio-mixer spans.

  2. There was a recent FR and subsequent PR asking for more information to be put in the spans generated by Mixer. For deployments that involve use of Mixer for policy checking, having those decisions (and the adapters that were involved) recorded in the distributed trace was considered quite useful. Mixer, when called, actively participates in the call graph (making enforcement decisions, adding latency, etc.). Having that reflected is desired in certain deployments.

    If you are not interested in policy enforcement in Mixer, one alternative is to simply turn that feature off (mesh-wide, or via an in-flight PR, via deployment spec annotations). That would eliminate the spans altogether (and save on per-request overhead).

Hope this information is helpful. Always happy to discuss more. Look forward to thoughts from others.

Thanks for listening,
Doug.

Doug,

Thanks so much for throwing light into this matter. It seems that 1.1 is the target to wait for then.

Not sure if I understood it correctly, but my use case is that I’m interested in using policies with Mixer, but I’d like to have a mechanism by which I could decide on whether I want Mixer’s spans to be part of a trace or not. Would that be possible in 1.1?

Thanks.

Doug,

Any feedback on my last question?

Thanks!

One mechanism for controlling that in 1.1 will be through command-line flags for the istio-policy deployment. By removing the trace url args from the deployment, you can prevent Mixer from participating in tracing altogether.

There isn’t a simple way to, per request, opt mixer out of/in to participating in a request that was already selected for distributed tracing (if the request is sampled, then mixer will participate).

You could switch to use Mixer exclusively for tracing (instead of directly from Envoy), and use special match clauses to more selectively control tracing, but that would be a pretty advanced use case – and I’m not sure that that is what you are asking for here.