Multicluster CockroachDB connection timeout

I’m trying to build an insecure CockroachDB cluster on two kubernetes tenats using istio 1.4.5 with replicated control planes to connect them. Each tenant has 3 cockroach nodes.

So far i only got timeout errors and need some help to further debug it.
Edit: It looks like this is a TLS issue. I have MeshPolicy set to PERMISSIVE and the tls mode for *.local set to DISABLE and for *.globalset to ISTIO_MUTUAL. With this setting it works on one tenant but not with both. If i enable mTLS for all its the same behavior on one tenant as before between both.

There are some other non-headless services already deployed that communicate fine over the multicluster ingress gateway, so i guess the problem lies somewhere in my CockroachDB configuration.

Based on this issue and this discussion, i changed the port name from grpc to tcp for the headless service, statefulsets and network policies:

apiVersion: v1
kind: Service
metadata:
  name: tenant1-cockroachdb
spec:
  clusterIP: None
  ports:
  - name: tcp
    port: 26257
    protocol: TCP
    targetPort: tcp
  - name: http
    port: 8080
    protocol: TCP
    targetPort: http
  publishNotReadyAddresses: true
  type: ClusterIP

and added 3 serviceentries, one for each CockroachDB node like:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: tenant1-cockroachdb-0-headless
spec:
  hosts:
  - tenant1-cockroachdb-0.tenant1-cockroachdb.default.svc.cluster.local
  location: MESH_INTERNAL
  ports:
  - name: tcp
    number: 26257
    protocol: TCP
  - name: http
    number: 8080
    protocol: HTTP
  resolution: DNS

With this CockroachDB started normally in an istio sidecar injected namespace and the nodes found each other:

❯ k get po
NAME                               READY   STATUS             RESTARTS   AGE
tenant1-cockroachdb-0              2/2     Running            0          50s
tenant1-cockroachdb-1              2/2     Running            0          50s
tenant1-cockroachdb-2              2/2     Running            0          50s
tenant1-cockroachdb-init-d2d5z     0/1     Completed          0          50s

Next i created matching service entries on tenant2, one for each node on tenant1:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: tenant1-cockroachdb-0
spec:
  addresses:
  - 240.0.0.16
  endpoints:
  - address: x.x.x.x
    ports:
      http: 15443
      tcp: 15443
  hosts:
  - tenant1-cockroachdb-0.tenant1-cockroachdb.default.global
  location: MESH_INTERNAL
  ports:
  - name: tcp
    number: 26257
    protocol: TCP
  - name: http
    number: 8080
    protocol: HTTP
  resolution: DNS

and the same on tenant1 for tenant2-cockroachdb-0.tenant2-cockroachdb.default.global.
I setup 3 new cockroach nodes on tenant2 and configured them to to join the tenant1 cluster, which resulted in timeout errors:

I200309 11:39:27.914146 91 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n?] circuitbreaker: gossip [::]:26257->tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I200309 11:39:27.914240 91 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n?] circuitbreaker: gossip [::]:26257->tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:26257 event: BreakerTripped
[...]
W200309 11:39:41.908057 94 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 240.0.0.16:26257: i/o timeout". Reconnecting...
W200309 11:39:42.908731 94 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...

For debugging this i tried to reach the web ui of one cockroach node on teanant1 from tenant2:

bash-5.0$ curl tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:8080
<!DOCTYPE html>
<html>
[...]
</html>

this works for all nodes.

Also i did check my networkpolicies which look like:

apiVersion: extensions/v1beta1
kind: NetworkPolicy
metadata:
  labels:
    app.kubernetes.io/instance: tenant2-cockroachdb
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: cockroachdb
    helm.sh/chart: cockroachdb-3.0.6
  name: tenant2-cockroachdb
spec:
  ingress:
  - ports:
    - port: tcp
      protocol: TCP
    - port: http
      protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/component: cockroachdb
      app.kubernetes.io/instance: tenant2-cockroachdb
      app.kubernetes.io/name: cockroachdb
  policyTypes:
  - Ingress

which should work with the ports tcp, http and the pods are labeled correctly.

Next i changed the log level of tenant1-cockroachdb-0 istio-proxy and tenant2-cockroachdb-0 istio-proxy to debug, but the logs are really big and i don’t know what I’m looking for, but so far i found no warnings or errors.

Any idea where i should look next?

Solution: I was using the UID 1337 which doesn’t work with istio, but did work locally because of the PERMISSIVE mesh policy.

Application UIDs: Ensure your pods do not run applications as a user with the user ID (UID) value of 1337