I have two grpc pods in my namespace (my-ns
). one POD is acting as a client, the other is acting as a gRPC server (on port 6001 with gRPC keepalive set). client keeps querying the server for health status all the time. it has a headless service as recommended here in the forum. so its really a simple setup.
they work together fine for weeks until I add to this mix: Istio v1.14.3 + STRICT peerAuthentication (i.e. mTLS).
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: mtls
namespace: istio-system
spec:
mtls:
mode: STRICT
all will work fine for several hours but then, out of nowhere, the server will start showing this:
Oct 6 04:09:57 mygRPCserver-6b59bf996-xlcvj istio-proxy "- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-" - - 172.30.12.244:6001 172.30.209.147:33112 - -
Oct 6 04:09:57 mygRPCserver-6b59bf996-xlcvj istio-proxy "- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-" - - 172.30.12.244:6001 172.30.209.147:33114 - -
Oct 6 04:09:57 mygRPCserver-6b59bf996-w2rqc istio-proxy "- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-" - - 172.30.6.63:6001 172.30.209.147:33296 - -
...many more like that...
the client at that time window will show
GRPC::Unavailable: 14:upstream connect error or disconnect/reset before headers. reset reason: connection termination
it can fail like that for 3-4 hours and then go back to healthy state. again: nothing changed in the cluster. Even worse: after hours of errors: it will suddenly “fix” itself and work normally.
I wonder why?
at first I thought: hey without mTLS (portLevel exclude on 6001) all works well , so maybe it is TLS keep alive problem. I added a DR with shorter keepalive interval:
spec:
host: '*.mygRPCserver-headless.my-ns.svc.cluster.local'
trafficPolicy:
connectionPool:
tcp:
connectTimeout: 30ms
maxConnections: 10
tcpKeepalive:
interval: 75s
time: 3600s
tls:
mode: ISTIO_MUTUAL
workloadSelector:
matchLabels:
name: app
value: mygRPCserver
it had zero effect.
looking for more ideas …
remember: if I exclude port 6001 from the istio mesh : all works well