I’m trying to build an insecure CockroachDB cluster on two kubernetes tenats using istio 1.4.5 with replicated control planes to connect them. Each tenant has 3 cockroach nodes.
So far i only got timeout errors and need some help to further debug it.
Edit: It looks like this is a TLS issue. I have MeshPolicy
set to PERMISSIVE
and the tls mode for *.local
set to DISABLE
and for *.global
set to ISTIO_MUTUAL
. With this setting it works on one tenant but not with both. If i enable mTLS for all its the same behavior on one tenant as before between both.
There are some other non-headless services already deployed that communicate fine over the multicluster ingress gateway, so i guess the problem lies somewhere in my CockroachDB configuration.
Based on this issue and this discussion, i changed the port name from grpc
to tcp
for the headless service, statefulsets and network policies:
apiVersion: v1
kind: Service
metadata:
name: tenant1-cockroachdb
spec:
clusterIP: None
ports:
- name: tcp
port: 26257
protocol: TCP
targetPort: tcp
- name: http
port: 8080
protocol: TCP
targetPort: http
publishNotReadyAddresses: true
type: ClusterIP
and added 3 serviceentries, one for each CockroachDB node like:
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: tenant1-cockroachdb-0-headless
spec:
hosts:
- tenant1-cockroachdb-0.tenant1-cockroachdb.default.svc.cluster.local
location: MESH_INTERNAL
ports:
- name: tcp
number: 26257
protocol: TCP
- name: http
number: 8080
protocol: HTTP
resolution: DNS
With this CockroachDB started normally in an istio sidecar injected namespace and the nodes found each other:
❯ k get po
NAME READY STATUS RESTARTS AGE
tenant1-cockroachdb-0 2/2 Running 0 50s
tenant1-cockroachdb-1 2/2 Running 0 50s
tenant1-cockroachdb-2 2/2 Running 0 50s
tenant1-cockroachdb-init-d2d5z 0/1 Completed 0 50s
Next i created matching service entries on tenant2, one for each node on tenant1:
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: tenant1-cockroachdb-0
spec:
addresses:
- 240.0.0.16
endpoints:
- address: x.x.x.x
ports:
http: 15443
tcp: 15443
hosts:
- tenant1-cockroachdb-0.tenant1-cockroachdb.default.global
location: MESH_INTERNAL
ports:
- name: tcp
number: 26257
protocol: TCP
- name: http
number: 8080
protocol: HTTP
resolution: DNS
and the same on tenant1 for tenant2-cockroachdb-0.tenant2-cockroachdb.default.global
.
I setup 3 new cockroach nodes on tenant2 and configured them to to join the tenant1 cluster, which resulted in timeout errors:
I200309 11:39:27.914146 91 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n?] circuitbreaker: gossip [::]:26257->tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I200309 11:39:27.914240 91 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n?] circuitbreaker: gossip [::]:26257->tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:26257 event: BreakerTripped
[...]
W200309 11:39:41.908057 94 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 240.0.0.16:26257: i/o timeout". Reconnecting...
W200309 11:39:42.908731 94 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
For debugging this i tried to reach the web ui of one cockroach node on teanant1 from tenant2:
bash-5.0$ curl tenant1-cockroachdb-0.tenant1-cockroachdb.default.global:8080
<!DOCTYPE html>
<html>
[...]
</html>
this works for all nodes.
Also i did check my networkpolicies which look like:
apiVersion: extensions/v1beta1
kind: NetworkPolicy
metadata:
labels:
app.kubernetes.io/instance: tenant2-cockroachdb
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: cockroachdb
helm.sh/chart: cockroachdb-3.0.6
name: tenant2-cockroachdb
spec:
ingress:
- ports:
- port: tcp
protocol: TCP
- port: http
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/component: cockroachdb
app.kubernetes.io/instance: tenant2-cockroachdb
app.kubernetes.io/name: cockroachdb
policyTypes:
- Ingress
which should work with the ports tcp, http and the pods are labeled correctly.
Next i changed the log level of tenant1-cockroachdb-0 istio-proxy
and tenant2-cockroachdb-0 istio-proxy
to debug, but the logs are really big and i don’t know what I’m looking for, but so far i found no warnings or errors.
Any idea where i should look next?