Inconsistency between Istio and Kubernetes endpoints

Hi,

We’re happy Istio users but occasionally suffering from a recurring issue where we lose connectively between applications running in the mesh after restarting one of the services (postgres-pooler in this case).

When this happens the sidecar seems to be attempting to connect the main container to pods that no longer exist.

Here’s an example where we have two deployments:

  • xgckw is working - the endpoints from istioctl proxy-config match those from kubectl get endpointslice.
  • dhjdj is not working - it is trying to connect to non-existent pods, via config provided by a different istiod, which doesn’t match the endpoints of the relevant Kubernetes Service.
$ kubectl get endpointslice postgres-pooler-x9pvh -n prod
NAME                    ADDRESSTYPE   PORTS   ENDPOINTS                  AGE
postgres-pooler-x9pvh   IPv4          5432    10.240.6.85,10.240.3.250   667d

$ istioctl proxy-status
NAME                                                  CLUSTER        CDS        LDS        EDS          RDS          ISTIOD                      VERSION
deployment-7f78785784-xgckw.prod                  Kubernetes     SYNCED     SYNCED     SYNCED       SYNCED       istiod-6ffd54b448-v9nck     1.13.8
deployment-788b6d8c6d-dhjdj.prod                  Kubernetes     SYNCED     SYNCED     SYNCED       SYNCED       istiod-6ffd54b448-9fwv4     1.13.8

$ istioctl proxy-config endpoint -n prod deployment-7f78785784-xgckw
ENDPOINT                         STATUS      OUTLIER CHECK     CLUSTER
10.0.171.251:5432                HEALTHY     OK                outbound|5432||postgres-pooler.prod.svc.cluster.local
10.240.3.250:5432                HEALTHY     OK                outbound|5432||postgres-pooler.prod.svc.cluster.local
10.240.6.85:5432                 HEALTHY     OK                outbound|5432||postgres-pooler.prod.svc.cluster.local

$ istioctl proxy-config endpoint -n prod deployment-788b6d8c6d-dhjdj
ENDPOINT                         STATUS      OUTLIER CHECK     CLUSTER
10.0.171.251:5432                HEALTHY     OK                outbound|5432||postgres-pooler.prod.svc.cluster.local
10.240.4.229:5432                HEALTHY     OK                outbound|5432||postgres-pooler.prod.svc.cluster.local
10.240.6.152:5432                HEALTHY     OK                outbound|5432||postgres-pooler.prod.svc.cluster.local

10.240.4.229 and 10.240.6.152 correspond to pods that no longer exist.

Does anyone have any ideas about how to debug this further or correct our setup? Any help would be greatly appreciated.