Istio mTLS strange behavior (bug?)

Currently, we are working on implementing a new GKE setup for our application. It includes GKE (1.16) with Workload Identity and Istio OSS (1.6.8). We are trying to apply STRICT mTLS policy. And it works fine for most microservices except few ones that are calling another one during startup. It receives TCP RST. And it receives due that fact, that it tries to connect without mTLS (plain-text http):

debug envoy connection [external/envoy/source/extensions/transport_sockets/tls/] [C77289] TLS error: 268435612:SSL routines:OPENSSL_internal:HTTP_REQUEST

I thought that we affected that bug:, but… it’s not true, because I’m able to do the same call via curl successfully (for test purposes, I’ve added curl command prior to java -jar app.jar in entrypoint). Moreover, in case I’m disabling STRICT mTLS mode during startup and re-add it (STRICT policy) after successful application initialization - it able to make the same call to the same destination microservice, which is totally weird.

Both microservices are Java11 springoot… ReactorNetty/0.9.5.RELEASE is used for making call.

We have old cluster (GKE 1.14 without workload identity) and Istio 1.1.17 and here it works fine.

P.S. I’m struggling with this issue without any significant progress for the whole last week, so any ideas/advice are welcomed.

I’ve tried to look into x_forwarded_client_cert headers (an only successful connection is logged in envoy logs) and the picture is following:
"x_forwarded_client_cert": "By=spiffe://cluster.local/ns/dst-qa/sa/iden-dst-qa-us-west1;Hash=2c63516e6f040e774e5d6b4ca42016f587dc831a3b59bcd031ae5661c00fb2b2;Subject="";URI=spiffe://cluster.local/ns/src-qa/sa/iden-src-qa-us-west1",

The connection from another source pod to the same destination:

"x_forwarded_client_cert": "By=spiffe://cluster.local/ns/dst-qa/sa/iden-dst-qa-us-west1;Hash=307a06d8fbbff3a23cce3fafa71890fdbb56e5557acdf64811b24916fe320956;Subject="";URI=spiffe://cluster.local/ns/src-dev/sa/iden-src-dev-us-west1",

spiffe URI is the same, but hash is different, so I hope that there should not be an issue. Each connection uses it’s own certificates(?)

The most interesting thing is tcpdump output. During the failed connection, we have the following picture:

So it sends packet 2 times. The problem is with packet num 2078… It sent as a plaint test, but the target expects mTLS. At the same time, we have packet num 2074 which I guess should be a valid one after which communication between pods should be established.

The situation with successful connection is also not so clear for me:

From my understanding - mTLS communication performed within 288-301 packets, but… why I can see plaintext response (packet num 302)?

P.S. http/400 is a valid response that I’m expecting as a result of my request.

Hi, may I know if you have applied both STRICT mTLS policy and DestinationRule? May you show some sample config you have applied for the mTLS?

Hi @jtrbs,
According to the documentation, the default mTLS policy managed with the following peer authentication policy manifest:

kind: PeerAuthentication
  name: default
  namespace: istio-system
    mode: STRICT

and DestinationRule is no more needed starting from 1.4.

Yes, Istio 1.5+ is using auto mTLS by default. If both client and server have sidecars, they will establish mTLS connection. If one of them does not have a sidecar, then it fallbacks to plain-text.

But… in our case, we have sidecars for both microservices in place. This issue looks very similar to, but it was fixed almost a year ago.

The issue is resolved now.
After examining of debug logs from caller pod, the following message was fond:

[2020-09-11T17:50:48.061Z] "- - -" 0 - "-" "-" 402 0 3885 - "-" "-" "-" "-" "" PassthroughCluster - 

The key is PassthroughCluster . It means that the connection was not handled by any of envoy routes and handled by default PassthroughCluster virtual cluster.

Disabling of PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_OUTBOUND solve the issue, but… a potential downside of this solution is that some of the telemetry metrics on the client side can be loosed.

More correct solution become increasing of protocolDetectionTimeout to 5s (default is 100ms). Now apps operate correctly in any phase.

Unfortunately, the root cause is still not clear.

P.S. Solution was provided by GCP Support Team.

Related github issue: