I’ve got an issue setting up istio in conjunction with a production jaeger installation and I was hoping to get other folks thoughts. The information here seems a little out of date. I have a jaeger installation managed by jaeger-operator that is installed into the observability
namespace. Specifically, I have jaeger-operator
1.14.0 installed. Now, I have a small service called telemetry_canary
whose whole job is to generate spans and metrics to help prove out configuration changes. Before introducing istio telemetry_canary
is able to emit spans into the jaeger-collector
service, using the following endpoint: http://jaeger-collector.observability:14268/api/traces
. I’m not using jaeger-operator’s ability to inject an agent as opencensus-java has a preference for delivery direct to the collector; telemetry_canary
is java code.
Anyhow, here’s the state of things in working configuration. First, the observability namespace:
~ > kubectl -n observability get pods
NAME READY STATUS RESTARTS AGE
elasticsearch-master-0 1/1 Running 0 5m4s
elasticsearch-master-1 1/1 Running 0 5m4s
elasticsearch-master-2 1/1 Running 0 5m4s
jaeger-collector-78dd68f6d7-2s6tx 1/1 Running 3 4m49s
jaeger-operator-6d6cc86d89-8qddt 1/1 Running 0 5m
jaeger-query-59c96887ff-2b25w 2/2 Running 3 4m49s
~ > kubectl -n observability get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
elasticsearch-master ClusterIP 10.59.242.114 <none> 9200/TCP,9300/TCP 5m8s
elasticsearch-master-headless ClusterIP None <none> 9200/TCP,9300/TCP 5m8s
jaeger-collector ClusterIP 10.59.251.218 <none> 9411/TCP,14250/TCP,14267/TCP,14268/TCP 4m53s
jaeger-collector-headless ClusterIP None <none> 9411/TCP,14250/TCP,14267/TCP,14268/TCP 4m53s
jaeger-operator ClusterIP 10.59.252.131 <none> 8383/TCP 4m54s
jaeger-query ClusterIP 10.59.250.8 <none> 16686/TCP 4m53s
~ > kubectl describe namespace observability
Name: observability
Labels: istio-injection=disabled
Annotations: <none>
Status: Active
Resource Quotas
Name: gke-resource-quotas
Resource Used Hard
-------- --- ---
count/ingresses.extensions 1 5k
count/jobs.batch 0 10k
pods 6 5k
services 6 1500
No resource limits.
And now the samples namespace:
~ > kubectl -n samples get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
telemetry-canary 1/1 1 1 9m35s
~ > kubectl -n samples get pods
NAME READY STATUS RESTARTS AGE
telemetry-canary-79868d748f-r6tvk 1/1 Running 0 11m
~ > kubectl describe namespace samples
Name: samples
Labels: istio-injection=disabled
Annotations: <none>
Status: Active
Resource Quotas
Name: gke-resource-quotas
Resource Used Hard
-------- --- ---
count/ingresses.extensions 0 5k
count/jobs.batch 0 10k
pods 1 5k
services 0 1500
No resource limits.
As you can see I have explicitly disabled istio injection and there are no running sidecars in either namespace. The traces for telemetry_canary
arrive as expected into jaeger. Things change when I enable istio into the samples
namespace. The scenario then becomes as follows:
~ > kubectl get namespace samples
NAME STATUS AGE
samples Active 15m
~ > kubectl describe namespace samples
Name: samples
Labels: istio-injection=enabled
Annotations: <none>
Status: Active
Resource Quotas
Name: gke-resource-quotas
Resource Used Hard
-------- --- ---
count/ingresses.extensions 0 5k
count/jobs.batch 0 10k
pods 2 5k
services 0 1500
No resource limits.
~ > kubectl -n samples get pods
NAME READY STATUS RESTARTS AGE
telemetry-canary-79868d748f-z5hfg 2/2 Running 0 56s
I’ve forced a redeploy of the telemetry_canary
pods and the istio sidecar is injected as expected. From the pod description:
istio-proxy:
Container ID: docker://aeb5c13cb54f0c8e9bc77a6efa5d6b37e1a4c35fb9e161d3035a26aa50bd7328
Image: docker.io/istio/proxyv2:1.3.0
Image ID: docker-pullable://istio/proxyv2@sha256:f3f68f9984dc2deb748426788ace84b777589a40025085956eb880c9c3c1c056
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--configPath
/etc/istio/proxy
--binaryPath
/usr/local/bin/envoy
--serviceCluster
telemetry-canary.$(POD_NAMESPACE)
--drainDuration
45s
--parentShutdownDuration
1m0s
--discoveryAddress
istio-pilot.istio-system:15010
--zipkinAddress
jaeger-collector.observability:9411
--dnsRefreshRate
300s
--connectTimeout
10s
--proxyAdminPort
15000
--concurrency
2
--controlPlaneAuthPolicy
NONE
--statusPort
15020
--applicationPorts
8081
Once this is enabled there are no longer any traces making it into jaeger. From telemetry_canary
's logs:
Oct 03, 2019 12:21:07 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 10 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure
Oct 03, 2019 12:21:12 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 11 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure
Oct 03, 2019 12:21:17 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 14 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure
Oct 03, 2019 12:21:22 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 6 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure
Oct 03, 2019 12:21:27 AM io.opencensus.exporter.trace.util.TimeLimitedHandler handleException
WARNING: Failed to export traces: java.util.concurrent.ExecutionException: io.jaegertracing.internal.exceptions.SenderException: Could not send 8 spans, response 503: upstream connect error or disconnect/reset before headers. reset reason: connection failure
My understanding of what’s happening here is that telemetry_canary
is trying to route to jaeger-collector.observability
through the mesh but observability
namespace is outside of the mesh, which earns me a 503. I actually did try pointing telemetry_canary
directly to jaeger-collector
's cluster IP which also did not work, for whatever that is worth. Enabling istio-injection into observability
namespace fails as the jaeger-operator seems to explicitly set istio injection to be false: jaeger’s elasticsearch cluster joins the mesh, the collector fails to connect to them – since it’s outside – and therefore never starts for want of a connection to its ES cluster.
I’m a little stumped as to what to try next. If anyone else has combined a production jaeger with istio I’d be very happy to know about your setup and I’m very happy to add anymore information here, as well.