mTLS + grpc : unstable istio-proxy state: filter_chain_not_found

I have two grpc pods in my namespace (my-ns). one POD is acting as a client, the other is acting as a gRPC server (on port 6001 with gRPC keepalive set). client keeps querying the server for health status all the time. it has a headless service as recommended here in the forum. so its really a simple setup.
they work together fine for weeks until I add to this mix: Istio v1.14.3 + STRICT peerAuthentication (i.e. mTLS).

  apiVersion: security.istio.io/v1beta1
  kind: PeerAuthentication
  metadata:
    name: mtls
    namespace: istio-system
  spec:
    mtls:
      mode: STRICT

all will work fine for several hours but then, out of nowhere, the server will start showing this:

Oct 6 04:09:57 mygRPCserver-6b59bf996-xlcvj istio-proxy "- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-" - - 172.30.12.244:6001 172.30.209.147:33112 - -
Oct 6 04:09:57 mygRPCserver-6b59bf996-xlcvj istio-proxy "- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-" - - 172.30.12.244:6001 172.30.209.147:33114 - -
Oct 6 04:09:57 mygRPCserver-6b59bf996-w2rqc istio-proxy "- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-" - - 172.30.6.63:6001 172.30.209.147:33296 - - 
...many more like that...

the client at that time window will show :slight_smile:

GRPC::Unavailable: 14:upstream connect error or disconnect/reset before headers. reset reason: connection termination

it can fail like that for 3-4 hours and then go back to healthy state. again: nothing changed in the cluster. Even worse: after hours of errors: it will suddenly “fix” itself and work normally.
I wonder why?

at first I thought: hey without mTLS (portLevel exclude on 6001) all works well , so maybe it is TLS keep alive problem. I added a DR with shorter keepalive interval:

  spec:
    host: '*.mygRPCserver-headless.my-ns.svc.cluster.local'
    trafficPolicy:
      connectionPool:
        tcp:
          connectTimeout: 30ms
          maxConnections: 10
          tcpKeepalive:
            interval: 75s
            time: 3600s
      tls:
        mode: ISTIO_MUTUAL
    workloadSelector:
      matchLabels:
        name: app
        value: mygRPCserver

it had zero effect.
looking for more ideas …
remember: if I exclude port 6001 from the istio mesh : all works well

I think I found the problem. need also a DR without the wildcard. the TLS keepalive section is not needed

spec:
    host: 'mygRPCserver-headless.my-ns.svc.cluster.local'
    trafficPolicy:
      tls:
        mode: ISTIO_MUTUAL

not sure why I need two destinationRules .