gRPC connection between two different meshes is reset

(Cross-post from my post in StackOverflow)
I have two different clusters (EKS, v1.18) with their own meshes (v1.9.0).

I have a Thanos deployment on cluster A and a Prometheus deployment on cluster B (with the thanos sidecar running too). The goal is to have thanos query these sidecars in remote clusters to proxy queries to each cluster (block persistence using S3 or similar is out of scope for this issue) via an internal load balancer (ELB classic)

The resources for Gateway, Virtual Service and Service are in place in cluster B, and I can run Thanos locally when connected to the network and connect to the sidecars in cluster B successfully using gRPC.

The ServiceEntry for the FQDN from cluster B has been created in cluster A, resolution works, routing is correct, but the deployment in cluster A can’t connect to cluster B.

Istio sidecars (from source workload, Thanos, in cluster A) show that the connection is being reset:

[2021-02-26T14:41:03.509Z] "POST /thanos.Store/Info HTTP/2" 0 - http2.remote_reset - "-" 5 0 4998 - "-" "grpc-go/1.29.1" "50912787-d528-994f-b8ad-78dd42081fea" "thanos.dev.integrations.internal.fqdn:10901" "-" - - 172.20.65.175:10901 172.30.9.174:37594 - default

I don’t see the incoming request in cluster B’s ingress gateway (I have a public one and a private one, I checked both just to be sure).

I have tried:

  • Forcing upgrade of http1.1 to http2 using DR
  • Forcing TLS to be disabled using DR
  • Excluding private LB CIDR range to bypass proxy

Resources (Cluster A)

ServiceEntry:

---
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: thanos-integrations-dev
  namespace: thanos
spec:
  hosts:
  - thanos.dev.integrations.internal.fqdn
  location: MESH_EXTERNAL
  ports:
  - name: grpc-thanos-int-dev
    number: 10901
    protocol: GRPC
  resolution: DNS

Resources (Cluster B)

Gateway:

---
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  annotations:
    meta.helm.sh/release-name: istio-routing-layer
    meta.helm.sh/release-namespace: istio-system
  creationTimestamp: "2021-02-25T11:37:49Z"
  generation: 3
  labels:
    app.kubernetes.io/instance: istio-routing-layer
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: istio-routing-layer
    app.kubernetes.io/version: 0.0.1
    helm.sh/chart: istio-routing-layer-0.0.1
  name: thanos
  namespace: istio-system
spec:
  selector:
    istio: internal-ingressgateway
  servers:
  - hosts:
    - thanos.dev.integrations.internal.fqdn
    port:
      name: grpc-thanos
      number: 10901

VirtualService:

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  annotations:
    meta.helm.sh/release-name: istio-routing-layer
    meta.helm.sh/release-namespace: istio-system
  creationTimestamp: "2021-02-25T11:37:49Z"
  generation: 3
  labels:
    app.kubernetes.io/instance: istio-routing-layer
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: istio-routing-layer
    app.kubernetes.io/version: 0.0.1
    helm.sh/chart: istio-routing-layer-0.0.1
spec:
  gateways:
  - thanos
  hosts:
  - thanos.dev.integrations.internal.fqdn
  http:
  - route:
    - destination:
        host: thanos-sidecar.prometheus.svc.cluster.local
        port:
          number: 10901

Service:

---
apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: prometheus-thanos-istio
    meta.helm.sh/release-namespace: prometheus
  creationTimestamp: "2021-02-25T14:31:02Z"
  labels:
    app.kubernetes.io/instance: prometheus-thanos-istio
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: prometheus-thanos-istio
    app.kubernetes.io/version: 0.0.1
    helm.sh/chart: prometheus-thanos-istio-0.0.1
spec:
  clusterIP: None
  ports:
  - name: grpc-thanos
    port: 10901
    protocol: TCP
    targetPort: grpc
  selector:
    app: prometheus
    component: server
  sessionAffinity: None
  type: ClusterIP

Hi , I met the similar issue and still got no solutions.(I’m getting 200 NR errors) But looks like you are listening on 10901 for the gateway , do you implement that port on your elb as well ? Because istio gateway doesn’t bind that port by default.