Ingress Gateway Bypass Virtualservice

I have a simple Istio setup and I have a link I don’t understand in my Kiali dashboard.

Here’s the graph, the link I don’t understand is the one from my Ingress gateway (istio-ingressgateway-carbon) to the unknown carbon.carbon.svc.cluster.local.

As you can see from the graph, probably all the requests from the outside (~1000/s) go to the VirtualService as expected. However, I have a few requests (~1req every 10s) that goes to the unknown service (at least from Kiali’s point of view), and I have no idea where they come from.

I tried changing the frequency of the liveness and readiness probes, but the frequency of the requests to the unknown service did not change, so it’s (probably) not that.

Anybody has any idea where are those from? Or how to find the source of these requests?

I’ll dump here the relevant (shortened) manifests, please don’t hesitate to ask for more if you think they can help.

Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: carbon
  namespace: carbon
spec:
  selector:
    matchLabels:
      app: carbon
      type: api
  template:
    metadata:
      labels:
        app: carbon
        type: api
    spec:
      serviceAccountName: carbon
      containers:
      - name: carbon
        ports:
        - name: http-carbon
          containerPort: 8990
        livenessProbe:
          httpGet:
            scheme: HTTP
            path: /metrics
            port: 8990
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            scheme: HTTP
            path: /metrics
            port: 8990
          initialDelaySeconds: 30
# A few things like affinity, volumes, etc. have been removed

Service:

apiVersion: v1
kind: Service
metadata:
  name: carbon
  namespace: carbon
spec:
  type: NodePort
  selector:
    app: carbon
    type: api
  ports:
  - name: http-carbon
    protocol: TCP
    port: 8990
    targetPort: 8990

VirtualService:

kind: VirtualService
apiVersion: networking.istio.io/v1beta1
metadata:
  name: carbon
  namespace: carbon
spec:
  gateways:
    - carbon-gateway
  hosts:
    - full.fqdn.com
  http:
    - route:
        - destination:
            host: carbon
          weight: 100

Gateway:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: carbon-gateway
spec:
  selector:
    istio: ingressgateway-carbon
  servers:
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE
        credentialName: https-secret
      hosts:
        - full.fqdn.com

I have setup a custom Ingress gateway using istioctl manifest generate with the following customisation:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  tag: 1.6.8
  addonComponents:
    prometheus:
      enabled: true
    kiali:
      enabled: true
    grafana:
      enabled: true
  components:
    ingressGateways:
      - name: istio-ingressgateway
        enabled: false
      - name: istio-ingressgateway-carbon
        enabled: true
        namespace: carbon
        label:
          app: istio-ingressgateway-carbon
          istio: ingressgateway-carbon
          release: istio
        k8s:
          service:
            loadBalancerIP: <redacted IP>
            ports:
              - name: https
                port: 443

Thanks you for your help!

I can’t really explain those requests to carbon.carbon.svc.cluster.local. Those sort of terminating service nodes, labeled with the FQDN, typically indicate that there was no metadata exchange but I thought that was mainly on TCP requests without mTLS. I’m also surprised to see requests flowing out of PassthroughCluster, that is typically only a destination, not a source.

It may be useful to look at the underlying time-series being reported by Istio. I’d like to maybe see the results of:

istio_requests_total{destination_service_name=“carbon.carbon.svc.cluster.local”}

istio_requests_total{destination_service_name=~"^.Passthrough.$"}

@Pengyuan_Bian, any ideas?

Looks like proto sniffing is the culprit. If there is no server first protocol in your cluster, can you set proto sniffing timeout to some large number (–set meshConfig.protocolDetectionTimeout=3600s) and see if the problem goes away?

Thanks for the quick answer! I’m currently not at work; I’ll send updates on the subject when I’m back, in a week or so.

Before I can test changing the protocol detection timeout, here is the result of the two prom requests you asked:

It appears changing the protocolDetectionTimeout value to 3600s did not do anything, and I still have this weird link to carbon.carbon.svc.cluster.local.
(I changed the config map value, then rebooted istiod, the Istio’s prometheus, and rollout restarted the Carbon namespace).

New information: After upgrading Istio to version 1.7, the number of connexion to carbon.carbon.svc.cluster.local increased to ~2rps (while the number of requests to our service did not change).

@Pengyuan_Bian, the telemetry captued in these queries is maybe interesting to you. These are all HTTP requests in istio_requests_total.

In the first (carbon.csv) we can see that there are sort of two sets of telem where destination_service= destination_service_name="carbon.carbon.svc.cluster.local", with destination_service_namespace="unknown". What is maybe interesting there is that some have principal set and some don’t, with slightly different flags.

The second set (passthrough.csv) is maybe odd as there are an enormous number of time-series where destination_service_name="PassthroughCluster" but the destination_service differs with raw IP:host addresses. And the destination_service_namespace="carbon" for passthrough. Is that expected?

I don’t know if it’s normal for the passthrough requests to go this way, but the context is: in a monit namespace, there is a Prometheus pod which is configured to auto-discover pods providing metrics, and will contact the pod’s IPs directly.

For the other (carbon.carbon.svc.cluster.local), I still have no clue what’s going on.

Looks like all time series with carbon.carbon.svc.cluster.local as destination service are failed requests, for various reasons.

@happn_frizlab Could you dump the istio_requests_total time series which are correctly labeled? Basically I am interested in the time series that draws the edge with 800+ qps between ingressgateway and carbon virtual service. Also @jshaughn can you help me a bit to understand how does kiali correctly requests_total metrics with virtual services?

There you go! Hopefully what you wanted, here’s the dump for the following request: istio_requests_total{destination_service_name="carbon",source_workload="istio-ingressgateway-carbon"}
https://frostland.fr/istio/carbon_virtual_service.csv

Ah I see what happens here. As a bit of background, when ingressgateway sending request to carbon service, they also exchange workload metadata between each other, which includes information like the namespace that destination is in, the workload name and labels etc. With the destination namespace information, source proxy could derive destination_service_name label (carbon in your case) from the fqdn (carbon.carbon.svc.cluster.local). For the requests that does not reach destination (the orphan edge in your graph), all peer information including destination namespace is missing, thus source proxy falls back to fqdn for destination_service_name label.

The solution for this is tracked here https://github.com/istio/istio/issues/24302. It was created for peer that is not in mesh, but the solution would also apply here.

Awesome! Glad to see the issue has already been spotted. I don’t fully understand the workaround, but if I understand correctly the issue should be fixed in 1.8, so I’ll just wait for an update to Istio :blush: