Citadel rejects requests from mesh expansion node

#1

I’m currently trying to get istio-1.1.4 up and running with mesh expansion and mTLS enabled. My main issue for now is making a host-based service reachable from within a k8s cluster.
Without mTLS everything looked fine. With mTLS the service was not reachable anymore, and checking the logs for the istio-auth-node-agent revealed the following:

istio-node-agent-start.sh[7992]: 2019-05-02T14:23:25.983781Z        info        pickfirstBalancer: HandleSubConnStateChange: 0xc42003a0d0, CONNECTING
istio-node-agent-start.sh[7992]: 2019-05-02T14:23:26.328077Z        info        grpc: addrConn.createTransport failed to connect to {istio-citadel:8060 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: read tcp 10.90.22.53:55980->10.86.6.81:8060: read: connection reset by peer". Reconnecting...

Citadel (which is contacted here) says the following:
2019-05-02T14:23:26.327519Z info grpc: Server.Serve failed to complete security handshake from "100.64.0.100:40098": remote error: tls: unknown certificate

The relevant certificates for /etc/certs have been fetched according to the mesh expansion documentation from the istio.default secret (from the namespace the expansion node shall be part of).

Anybody else observing that and having an idea how to debug this further or what I am doing wrong?

#2

I made some progress:
It seems to be caused by activating controlPlaneSecurity (which is disabled in the helm chart by default, but assumed for the mesh expansion tutorial). The helm chart does in that case explicitly set a DestinationRule with tls.mode: DISABLE for pilot, but not for citadel. If I do that, citadel connections work again.
Now I do have a working connection from istio-auth-node-agent to the control plane, but my mesh expansion in general is broken. I currently cannot persuade a pod to send any traffic to my expansion node (according to iptables on the target node).

#3

@incfly
do you have the same experience when having pod talk to expansion node in mesh expansion tutorial?

#4

More progress:
By carefully choosing when to use abbreviated names and to use fqdn in the ServiceEntry and DestinationRule definitions (and eventually just dropping the DestinationRule sind mtls is activated by the default rule anyway when using *.local names), my pods are reachable now. Partially the envoy config could not be generated because of collisions of the names in ServiceEntry, DestinationRule and service/endpoint registered for the expansion node via istioctl.

I additionally observed that I need to restart the envoy on the expansion node from time to time. That is even more confusing since it depends on the source node of the communication if it works or not - but restarting the target makes the experience consistent again …

#5

I’ve managed to get the mTLS working between k8s and vm services.

My initial guess of the mTLS failure is due to the lack of secure naming information. Pilot does not know what service account the VM services is running as. The supported mechanism is using annotations on k8s services, i’m not sure if we have documentation on that…

Control Plane mutual TLS should be used since we use gateway with sni routing + mtls for vm to reach out to pilot and citadel.

Also can you elaborate a little bit about why envoy restarting is needed on vm? what issue will occur if you don’t restart? This sounds like a serious bug that we should track.

#6

Seems I got my restarting problems solved on the vms, that was an error on my side. There was local puppet run messing with iptables which broke the communication path.
I started another test run and will see how long that looks fine.
I think my original issue with citadel is still valid, needing to disable tls in the destination rule for talking to citadel