Issues with configuring VM mesh expansion

Hey guys,

I have set up a k8s 1.12.6-gke.10 cluster in GKE with Istio 1.1.2.
I have also spun up 1 Debian VM, with the sidecar deb package installed and configured, serving a simple web page (with the name of the service and the IP of the VM) on port 80.

The VM is configured with the k8s dns in its resolv.conf and can resolve services to their ClusterIPs.
On the VM, I have the following sidecar.env

ISTIO_SERVICE_CIDR=10.24.164.0/22
ISTIO_SERVICE=feature2
ISTIO_INBOUND_INTERCEPTION_MODE=REDIRECT
ISTIO_INBOUND_PORTS=80
ISTIO_NAMESPACE=default
ISTIO_SVC_IP=10.24.0.63

Originally, I tried TPROXY interception mode, but that was not working. Trying to access port 80 on that VM remotely was timing out. REDIRECT however is working and allowing the traffic in.
And the following service definitions:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: feature2
  namespace : default
  app: feature2
spec:
  hosts:
    - 'feature2.default.svc.cluster.local'
  addresses:
    - 10.24.0.63/32
  location: MESH_INTERNAL
  ports:
  - number: 80
    name: http
    protocol: HTTP
  resolution: STATIC
  endpoints:
  - address: 10.24.0.63
    labels:
        app: feature2
        version: "v1"

---
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    alpha.istio.io/kubernetes-serviceaccounts: default
  name: feature2
  namespace: default
subsets:
- addresses:
  - ip: 10.24.0.63
  ports:
  - name: http
    port: 80
    protocol: TCP

---
apiVersion: v1
kind: Service
metadata:
  annotations:
    alpha.istio.io/kubernetes-serviceaccounts: default
  name: feature2
  namespace: default
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

And in Kiali, under services, it is shown as Missing sidecar, and I see feature6-v1 and sleep workloads that I do not expect to be there. The service also doesn’t show up under “Applications” in Kiali.

When making a curl request from a pod-based sleep service to the VM based service, I receive the web page, however I do not see any telemetry in Kiali other than the service entry. Here is the output of the curl request

    curl -v feature2.default.svc.cluster.local
    * Rebuilt URL to: feature2.default.svc.cluster.local/
    *   Trying 10.24.167.140...
    * TCP_NODELAY set
    * Connected to feature2.default.svc.cluster.local (10.24.167.140) port 80 (#0)
    > GET / HTTP/1.1
    > Host: feature2.default.svc.cluster.local
    > User-Agent: curl/7.60.0
    > Accept: */*
    > 
    < HTTP/1.1 200 OK
    < server: envoy
    < date: Mon, 15 Apr 2019 14:04:11 GMT
    < content-type: text/html
    < content-length: 44
    < last-modified: Fri, 12 Apr 2019 14:16:11 GMT
    < etag: "5cb09dab-2c"
    < accept-ranges: bytes
    < x-envoy-upstream-service-time: 3
    < 
    <h1>Welcome to Feature 2 on 10.24.0.63</h1>
    * Connection #0 to host feature2.default.svc.cluster.local left intact

As shown above, the Envoy sidecar on the VM is intercepting the request, as we can see the curl output is shown as server: envoy. With a MESH_INTERNAL service entry configured as above, in the Kiali service graph, I see sleep —> feature2.default.svc.cluster.local

The VM based service only shows up because of the ServiceEntry, and the information and telemetry for it is very limited, compared to a normal pod-based service. Are VM-based services expected to only show up as service entries?

The bigger issue is when trying a VM - > Pod request. In that case, the source is shown as unknown, but I also see that unknown is sending traffic to istio-telemetry. feature6 is a pod-based service. In Kiali, I see the following:


and here is the curl output when accessing feature6 ( pod based ) from feature2 ( the VM ):

curl -v feature6.default.svc.cluster.local
* Rebuilt URL to: feature6.default.svc.cluster.local/
*   Trying 10.24.164.4...
* TCP_NODELAY set
* Connected to feature6.default.svc.cluster.local (10.24.164.4) port 80 (#0)
> GET / HTTP/1.1
> Host: feature6.default.svc.cluster.local
> User-Agent: curl/7.52.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< server: envoy
< date: Mon, 15 Apr 2019 14:14:41 GMT
< content-type: text/html
< content-length: 612
< last-modified: Tue, 09 Apr 2019 11:20:51 GMT
< etag: "5cac8013-264"
< accept-ranges: bytes
< x-envoy-upstream-service-time: 2
< x-envoy-upstream-healthchecked-cluster: feature6.default
< 
<h1>Welcome to feature 6!</h1>
* Curl_http_done: called premature == 0
* Connection #0 to host feature6.default.svc.cluster.local left intact

I have also defined an additional Prometheus scrape config, to scrape the VM-based endpoints in the hope that I see more telemetry for the VM based service, with no luck. The scrape config is the following:
# Scrape config for VM envoy stats
- job_name: ‘envoy-vm-stats’
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
metrics_path: /stats/prometheus

  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - default

  relabel_configs:
  - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
  - source_labels: [__address__]
    separator:     ':'
    regex:         '(.*):(\d+)'
    target_label:  '__address__'
    replacement:   '${1}:15090'
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

In Prometheus, for the istio_requests_total query, for sleepfeature6 ( Pod to pod ), I can see:
istio_requests_total{connection_security_policy=“none”,destination_app=“feature6”,destination_principal=“unknown”,destination_service=“feature6.default.svc.cluster.local”,destination_service_name=“feature6”,destination_service_namespace=“default”,destination_version=“v1”,destination_workload=“feature6-v1”,destination_workload_namespace=“default”,instance=“10.24.176.8:42422”,job=“istio-mesh”,permissive_response_code=“none”,permissive_response_policyid=“none”,reporter=“destination”,request_protocol=“http”,response_code=“200”,response_flags=“-”,source_app=“sleep”,source_principal=“unknown”,source_version=“unknown”,source_workload=“sleep”,source_workload_namespace=“default”}

But for feature2 → feature6 (VM to pod), I see the source as unknown:
istio_requests_total{connection_security_policy=“none”,destination_app=“feature6”,destination_principal=“unknown”,destination_service=“feature6.default.svc.cluster.local”,destination_service_name=“feature6”,destination_service_namespace=“default”,destination_version=“v1”,destination_workload=“feature6-v1”,destination_workload_namespace=“default”,instance=“10.24.176.8:42422”,job=“istio-mesh”,permissive_response_code=“none”,permissive_response_policyid=“none”,reporter=“destination”,request_protocol=“http”,response_code=“200”,response_flags=“-”,source_app=“unknown”,source_principal=“unknown”,source_version=“unknown”,source_workload=“unknown”,source_workload_namespace=“unknown”}

The instance IP(instance=“10.24.176.8:42422) belongs to istio-telemetry.

What am I missing? Why is the VM not identified as the source?
Any ideas why this is not working as expected?

Thanks,
Vivo

This is a known bug for Istio telemetry. The fundamental problem is because mixer does not take mesh expansion service discovery into account.

@douglas-reid has a pending PR to fix the issue. https://github.com/istio/istio/pull/12816