Connection returns TLS error after running for few days

Hi, I am having a problem with istio in my current production setup and would need your help to troubleshoot it.

Background:

I am running Istio 1.1.7 in all our environments on kubernetes (amazon eks) 1.12.7 with mtls enable on application namespace, sds in both ingress gateway and sidecar.
There is no circuit breaker, no custom root CA for citadel.

Problem

The behaviour I saw is at first, all services in cluster are working fine, connection from ingress controller hit the services and return correctly.

But after a while, days or weeks, i haven’t been able to find the pattern, all connections from ingress to services return 503 UF, URX.
There are logs in istio-proxy container of ingress pod but no log in the upstream service’s istio-proxy container.

In example log (sorry for the format, i pull it out from elasticsearch)

"stream_name": "istio-ingressgateway-76749b4bb4-z6n78",
"istio_policy_status": "-",
"bytes_sent": "91",
"upstream_cluster": "outbound|8080||frontend.services.svc.cluster.local",
"downstream_remote_address": "172.23.24.174:30690",
"path": "/user",
"authority": "prod.example.com",
"protocol": "HTTP/1.1",
"upstream_service_time": "-",
"upstream_local_address": "-",
"duration": "69",
"downstream_local_address": "172.23.24.189:443",
"response_code": "503",
"user_agent": "Mozilla/5.0 (Linux; Android 8.0.0) ...",
"response_flags": "UF,URX",
"start_time": "2019-06-03T13:26:06.617Z",
"method": "GET",
"request_id": "320037db-601b-9c52-861f-bwoeifwoiegi",
"upstream_host": "172.23.24.143:80",
"x_forwarded_for": "218.186.146.112,172.23.24.174",
"requested_server_name": "prod.example.com",
"bytes_received": "0",

I tried to enable debug logging in proxy sidecar with

curl -XPOST localhost:15000/logging?connection=debug

then i found this in the isito-proxy container of the ingress controller:

[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:644] [C79846] connecting to 172.23.14.229:80
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:517] [C79846] connected
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:653] [C79846] connection in progress
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 2
[2019-05-21 08:18:36.883][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 2
[2019-05-21 08:18:36.883][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 2
[2019-05-21 08:18:36.885][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 1
[2019-05-21 08:18:36.885][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C79846] TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED
[2019-05-21 08:18:36.885][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C79846] closing socket: 0

So it looks like there are some problem with the TLS cert. The cert in istio-ca-secret and istio.istio-ingressgateway-service-account look correct and are not expired yet. Same goes for the internal certificates for my upstream services.
And as far as I can tell, this only happens when the service pods runs for a few days without being restarted or deployed with a new version.

I also saw another instance of the problem, but these logs were found inside the upstream service’s istio-proxy container, and the TLS error is different from the one in the ingress controller:

[2019-06-04 01:18:58.029][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C400] handshake error: 2
[2019-06-04 01:18:58.029][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C400] handshake error: 2
[2019-06-04 01:18:58.031][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C400] handshake error: 1
[2019-06-04 01:18:58.031][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C400] TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2019-06-04 01:18:58.031][32][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C400] closing socket: 0

I am not sure what actually happened here; The citadel logs, node agent logs and the rest looked normal at that point in time.

Please let me know if there are any other logs/config you need to troubleshoot the problem.

Thanks

Update 1: clarify about mtls config