Help with Production Jaeger and Istio

I’ve got an issue setting up istio in conjunction with a production jaeger installation and I was hoping to get other folks thoughts. The information here seems a little out of date. I have a jaeger installation managed by jaeger-operator that is installed into the observability namespace. Specifically, I have jaeger-operator 1.14.0 installed. Now, I have a small service called telemetry_canary whose whole job is to generate spans and metrics to help prove out configuration changes. Before introducing istio telemetry_canary is able to emit spans into the jaeger-collector service, using the following endpoint: http://jaeger-collector.observability:14268/api/traces. I’m not using jaeger-operator’s ability to inject an agent as opencensus-java has a preference for delivery direct to the collector; telemetry_canary is java code.

Anyhow, here’s the state of things in working configuration. First, the observability namespace:

~ > kubectl -n observability get pods
NAME                                READY   STATUS    RESTARTS   AGE
elasticsearch-master-0              1/1     Running   0          5m4s
elasticsearch-master-1              1/1     Running   0          5m4s
elasticsearch-master-2              1/1     Running   0          5m4s
jaeger-collector-78dd68f6d7-2s6tx   1/1     Running   3          4m49s
jaeger-operator-6d6cc86d89-8qddt    1/1     Running   0          5m
jaeger-query-59c96887ff-2b25w       2/2     Running   3          4m49s
~ > kubectl -n observability get service
NAME                            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                  AGE
elasticsearch-master            ClusterIP   10.59.242.114   <none>        9200/TCP,9300/TCP                        5m8s
elasticsearch-master-headless   ClusterIP   None            <none>        9200/TCP,9300/TCP                        5m8s
jaeger-collector                ClusterIP   10.59.251.218   <none>        9411/TCP,14250/TCP,14267/TCP,14268/TCP   4m53s
jaeger-collector-headless       ClusterIP   None            <none>        9411/TCP,14250/TCP,14267/TCP,14268/TCP   4m53s
jaeger-operator                 ClusterIP   10.59.252.131   <none>        8383/TCP                                 4m54s
jaeger-query                    ClusterIP   10.59.250.8     <none>        16686/TCP                                4m53s
~ > kubectl describe namespace observability
Name:         observability
Labels:       istio-injection=disabled
Annotations:  <none>
Status:       Active

Resource Quotas
 Name:                       gke-resource-quotas
 Resource                    Used  Hard
 --------                    ---   ---
 count/ingresses.extensions  1     5k
 count/jobs.batch            0     10k
 pods                        6     5k
 services                    6     1500

No resource limits.

And now the samples namespace:

~ > kubectl -n samples get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
telemetry-canary   1/1     1            1           9m35s
~ > kubectl -n samples get pods
NAME                                READY   STATUS    RESTARTS   AGE
telemetry-canary-79868d748f-r6tvk   1/1     Running   0          11m
~ > kubectl describe namespace samples
Name:         samples
Labels:       istio-injection=disabled
Annotations:  <none>
Status:       Active

Resource Quotas
 Name:                       gke-resource-quotas
 Resource                    Used  Hard
 --------                    ---   ---
 count/ingresses.extensions  0     5k
 count/jobs.batch            0     10k
 pods                        1     5k
 services                    0     1500

No resource limits.

As you can see I have explicitly disabled istio injection and there are no running sidecars in either namespace. The traces for telemetry_canary arrive as expected into jaeger. Things change when I enable istio into the samples namespace. The scenario then becomes as follows:

~ > kubectl get namespace samples
NAME      STATUS   AGE
samples   Active   15m
~ > kubectl describe namespace samples
Name:         samples
Labels:       istio-injection=enabled
Annotations:  <none>
Status:       Active

Resource Quotas
 Name:                       gke-resource-quotas
 Resource                    Used  Hard
 --------                    ---   ---
 count/ingresses.extensions  0     5k
 count/jobs.batch            0     10k
 pods                        2     5k
 services                    0     1500

No resource limits.
~ > kubectl -n samples get pods
NAME                                READY   STATUS    RESTARTS   AGE
telemetry-canary-79868d748f-z5hfg   2/2     Running   0          56s

I’ve forced a redeploy of the telemetry_canary pods and the istio sidecar is injected as expected. From the pod description:

  istio-proxy:
    Container ID:  docker://aeb5c13cb54f0c8e9bc77a6efa5d6b37e1a4c35fb9e161d3035a26aa50bd7328
    Image:         docker.io/istio/proxyv2:1.3.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:f3f68f9984dc2deb748426788ace84b777589a40025085956eb880c9c3c1c056
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --configPath
      /etc/istio/proxy
      --binaryPath
      /usr/local/bin/envoy
      --serviceCluster
      telemetry-canary.$(POD_NAMESPACE)
      --drainDuration
      45s
      --parentShutdownDuration
      1m0s
      --discoveryAddress
      istio-pilot.istio-system:15010
      --zipkinAddress
      jaeger-collector.observability:9411
      --dnsRefreshRate
      300s
      --connectTimeout
      10s
      --proxyAdminPort
      15000
      --concurrency
      2
      --controlPlaneAuthPolicy
      NONE
      --statusPort
      15020
      --applicationPorts
      8081

Once this is enabled there are no longer any traces making it into jaeger. From telemetry_canary's logs:

Oct 03, 2019 12:21:07 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 10 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure
Oct 03, 2019 12:21:12 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 11 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure
Oct 03, 2019 12:21:17 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 14 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure
Oct 03, 2019 12:21:22 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 6 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure
Oct 03, 2019 12:21:27 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 8 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure

My understanding of what’s happening here is that telemetry_canary is trying to route to jaeger-collector.observability through the mesh but observability namespace is outside of the mesh, which earns me a 503. I actually did try pointing telemetry_canary directly to jaeger-collector's cluster IP which also did not work, for whatever that is worth. Enabling istio-injection into observability namespace fails as the jaeger-operator seems to explicitly set istio injection to be false: jaeger’s elasticsearch cluster joins the mesh, the collector fails to connect to them – since it’s outside – and therefore never starts for want of a connection to its ES cluster.

I’m a little stumped as to what to try next. If anyone else has combined a production jaeger with istio I’d be very happy to know about your setup and I’m very happy to add anymore information here, as well.

I think you will need to disable mTLS (assuming you have it enabled), e.g. https://istio.io/docs/tasks/security/authn-policy/#request-from-istio-services-to-non-istio-services

If this works, could you raise an issue in https://github.com/istio/istio.io?

1 Like

Hi @objectiser, it took me a minute to figure out how to disable mTLS but, indeed, that was the issue. Now that I’ve redeployed istio without mTLS enabled I see traces working as expected. I will raise an issue.

Ah, actually, I may have gone a little further than you intended. I’ve disabled mTLS entirely, globally but it seems as if the link you provided suggested disabling mTLS only to the affected namespace. I’ll give that a shot when I make a pass at re-enabling mTLS.