Istio 1.6 strange behavior after upgrade (traffic disruption)

Hi all,

I appreciate any help that can help me to understand the problem and make my cluster working again.

I have a K8s 1.15.7 on-premise cluster with about 120 workloads and about a 10 Cronjobs starting each in 5 to 30 minutes interval. I had Istio since 0.8 periodically migrating to new versions and last working was a 1.5.1.

I decided to migrate to 1.6.1 - dropped 1.5.1 installed with Helm, dropped all Istio CRDs, installed 1.6.1 with Istioctl. 1.6.1 installed with default profile with minor changes:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    outboundTrafficPolicy:
      mode: REGISTRY_ONLY
    accessLogFile: "/dev/stdout"
  components:
    pilot:
      k8s:
        replicaCount: 2
        hpaSpec:
          minReplicas: 2          
    proxy:
      k8s:
        resources:
          requests:
            cpu: 10m
    ingressGateways:
    - name: istio-ingressgateway
      enabled: true
      k8s:
        env:
          - name: ISTIO_META_ROUTER_MODE
            value: "sni-dnat"
        service:
          type: NodePort
          ports:
            - port: 15021
              targetPort: 15021
              name: status-port
            - port: 80
              targetPort: 8080
              name: http2
            - port: 443
              targetPort: 8443
              name: https
              nodePort: 31390
            - port: 15443
              targetPort: 15443
              name: tls
        hpaSpec:
          maxReplicas: 5
          minReplicas: 2

Then I have a problem with service interconnection. Small amount of workloads cannot connect to other services. Some services after 4-5 minutes start working, but some still after 1 hour and multiple primary container restarts cannot.

For example:

Service on startup call:Sending HTTP request “POST” http://platform-auth-sts.dmz.svc.cluster.local/connect/token

Then Envoy on this POD says:"- - -" 0 UH “-” “-” 0 0 0 - “-” “-” “-” “-” “-” - - 10.105.219.37:80 10.244.4.1:40756 - -

Other services and two of our CronJobs starting each 5 min without problem communicate with this endpoint.

Target Service Definition and Endpoints from K8s:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2019-01-22T09:20:22Z"
  labels:
    app: platform-auth-sts
    chart: platform-auth-sts-0.4.20
    heritage: Tiller
    release: platform-auth-sts
  name: platform-auth-sts
  namespace: dmz
  resourceVersion: "132316717"
  selfLink: /api/v1/namespaces/dmz/services/platform-auth-sts
  uid: f1a80a1f-1e26-11e9-8395-000c29cb8c62
spec:
  clusterIP: 10.105.219.37
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  selector:
    app: platform-auth-sts
    release: platform-auth-sts
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2020-06-11T10:36:26Z"
  creationTimestamp: "2019-01-22T09:20:22Z"
  labels:
    app: platform-auth-sts
    chart: platform-auth-sts-0.4.20
    heritage: Tiller
    release: platform-auth-sts
  name: platform-auth-sts
  namespace: dmz
  resourceVersion: "134288719"
  selfLink: /api/v1/namespaces/dmz/endpoints/platform-auth-sts
  uid: f1a9b57d-1e26-11e9-8395-000c29cb8c62
subsets:
- addresses:
  - ip: 10.244.4.35
    nodeName: k8s-node1.abc
    targetRef:
      kind: Pod
      name: platform-auth-sts-677fcf79db-jlzhh
      namespace: dmz
      resourceVersion: "134288717"
      uid: 8b574c2b-8dbd-4463-a8a7-a2275a05130d
  - ip: 10.244.5.54
    nodeName: k8s-node3.abc
    targetRef:
      kind: Pod
      name: platform-auth-sts-677fcf79db-cj8h7
      namespace: dmz
      resourceVersion: "134288520"
      uid: 2f599029-c0a5-42aa-93c1-87062dfa6626
  ports:
  - name: http
    port: 80
    protocol: TCP

IstioCtl

Endpoints:

10.244.4.35:80  HEALTHY  OK   outbound|80||platform-auth-sts.dmz.svc.cluster.local
10.244.5.54:80  HEALTHY  OK   outbound|80||platform-auth-sts.dmz.svc.cluster.local
[
    {
        "name": "outbound|80||platform-auth-sts.dmz.svc.cluster.local",
        "addedViaApi": true,
        "hostStatuses": [
            {
                "address": {
                    "socketAddress": {
                        "address": "10.244.4.35",
                        "portValue": 80
                    }
                },
                "stats": [
                    {
                        "name": "cx_connect_fail"
                    },
                    {
                        "name": "cx_total"
                    },
                    {
                        "name": "rq_error"
                    },
                    {
                        "name": "rq_success"
                    },
                    {
                        "name": "rq_timeout"
                    },
                    {
                        "name": "rq_total"
                    },
                    {
                        "type": "GAUGE",
                        "name": "cx_active"
                    },
                    {
                        "type": "GAUGE",
                        "name": "rq_active"
                    }
                ],
                "healthStatus": {
                    "edsHealthStatus": "HEALTHY"
                },
                "weight": 1,
                "locality": {}
            },
            {
                "address": {
                    "socketAddress": {
                        "address": "10.244.5.54",
                        "portValue": 80
                    }
                },
                "stats": [
                    {
                        "name": "cx_connect_fail"
                    },
                    {
                        "name": "cx_total"
                    },
                    {
                        "name": "rq_error"
                    },
                    {
                        "name": "rq_success"
                    },
                    {
                        "name": "rq_timeout"
                    },
                    {
                        "name": "rq_total"
                    },
                    {
                        "type": "GAUGE",
                        "name": "cx_active"
                    },
                    {
                        "type": "GAUGE",
                        "name": "rq_active"
                    }
                ],
                "healthStatus": {
                    "edsHealthStatus": "HEALTHY"
                },
                "weight": 1,
                "locality": {}
            }
        ]
    }
]

Clusters:

SERVICE FQDN                             PORT      SUBSET         DIRECTION     TYPE
platform-auth-sts.dmz.svc.cluster.local  80        -              outbound      EDS