Migrate from GKE Istio add-on to open source Istio

Hello

We are using GKE Istio add-on version 1.6.14-gke.1.

We are considering migrating to the open source version of Istio. Google documentation mention that this is possible to do but not how you do it. So if we want to do the migration would it be possible to follow the canary upgrade to istio version 1.7.8?

Does anyone have experience with Anthos Service Mesh? Im just wondering if you install Anthos service mesh in one cluster do you then have one Anthos Service Mesh client? Or is an Anthos Service Mesh client something else?

Thanks

Hi @BoHenriksen, I’ve been playing with this, and found your post searching for tips. I’ll share what I’ve found so far.

To answer the second questions first: I was unimpressed with Anthos Service Mesh. My major turn offs were:

  1. Lots of manual work required on top of managing the mesh - including upgrades. (improvements to this were released yesterday, but I haven’t dug in)
  2. Lots of untrustworthy setup steps: “curl | kubectl” isn’t cool.
  3. Substantially more complex than istio itself, both from the operations and purchasing points of view.
  4. After being burned by Istio on GKE, it’s hard to trust another Google-managed version of this again.

Since imo ASM isn’t a good option, I’ve been working on the canary upgrade from GKE’s 1.6 operator deployment to 1.7. It isn’t documented anywhere, I had to assemble this from Google’s 1.4 → 1.6 upgrade script, plus various istio sources. The istio docs leave a lot to be desired, especially when you need details on an old release - GKE’s IstioOperator config is a mystery to me, so for some of this, I’m just wacking my way through. There may be mistakes.

Here’s my notes, they’re a bit rough. At a high-level, I’m following the instructions at Istioldie 1.8 / Managing Gateways with Multiple Revisions [experimental] to do canary upgrades with a separate gateway control plane. This is useful for reducing surprise gateway upgrades during the process - it looks more complicated, but I found it simpler to execute during testing.

  1. deploy a new cluster w/ istio add on, and then disable it: (nothing special about this cluster, just a simple hack to make deploying a new test cluster easy. Feel free to do your own.)

    NUM=${NUM:-0}
    gcloud beta container clusters create cluster-$((NUM=NUM+1)) \
           --zone us-central1-a \
           --no-enable-basic-auth \
           --cluster-version 1.18.17-gke.1900 \
           --release-channel stable \
           --machine-type e2-standard-2 \
           --image-type COS \
           --disk-type pd-standard \
           --disk-size 100 \
           --metadata disable-legacy-endpoints=true \
           --scopes https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \
           --max-pods-per-node 110 \
           --preemptible \
           --num-nodes 3 \
           --enable-stackdriver-kubernetes \
           --enable-ip-alias \
           --network projects/YOUR-PROJECT/global/networks/default \
           --subnetwork projects/YOUR-PROJECT/regions/us-central1/subnetworks/default \
           --no-enable-intra-node-visibility \
           --default-max-pods-per-node 110 \
           --enable-network-policy \
           --no-enable-master-authorized-networks \
           --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver,Istio \
           --enable-autoupgrade \
           --enable-autorepair \
           --max-surge-upgrade 1 \
           --max-unavailable-upgrade 0 \
           --maintenance-window-start 2021-06-15T07:00:00Z \
           --maintenance-window-end 2021-06-16T07:00:00Z \
           --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA,SU" \
           --enable-shielded-nodes \
           --shielded-secure-boot \
           --node-locations us-central1-a \
           --istio-config auth=MTLS_STRICT
    
    # wait, wait, wait
    
    gcloud beta container clusters update cluster-$NUM \
           --zone=us-central1-a
           --update-addons=Istio=DISABLED
    
  2. Do the last few steps of the 1.4 → 1.6 upgrade since the GKE addon doesn’t:

    # disable galley webhook from 1.4
    kubectl patch clusterrole -n istio-system istio-galley-istio-system --type='json' -p='[{"op": "replace", "path": "/rules/2/verbs/0", "value": "get"}]'
    kubectl delete ValidatingWebhookConfiguration istio-galley
    
    # migrate ingressgateway to new istio
    operator_cr_name=$(kubectl get istiooperators -n istio-system --no-headers | awk '{print $1}')
    kubectl patch istiooperator -n istio-system "${operator_cr_name}" --type='json' -p='[{"op": "replace", "path": "/spec/components/ingressGateways/0/enabled", "value": true}]'
    
  3. Label the default namespace for injection:

    kubectl label namespace default istio.io/rev=istio-1611
    
  4. Deploy an example nginx app

    ---
    apiVersion: networking.istio.io/v1alpha3
    kind: Gateway
    metadata:
      name: example
      namespace: istio-system
    spec:
      selector:
        istio: ingressgateway
      servers:
        - hosts:
            - '*.example.com'
          port:
            name: http
            number: 80
            protocol: HTTP
    
    ---
    apiVersion: networking.istio.io/v1beta1
    kind: VirtualService
    metadata:
      name: example
    spec:
      gateways:
        - istio-system/example
      hosts:
        - example.example.com
      http:
        - match:
            - uri:
                prefix: /
          rewrite:
            uri: /
          route:
            - destination:
                host: example.default.svc.cluster.local
                port:
                  number: 80
    
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: example
      name: example
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: example
      template:
        metadata:
          labels:
            app: example
        spec:
          containers:
            - image: nginx:latest
              name: nginx
              resources:
                requests:
                  cpu: 128m
                  memory: 128Mi
                limits:
                  cpu: 128m
                  memory: 128Mi
              ports:
                - containerPort: 80
                  protocol: TCP
    
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: example
      name: example
    spec:
      ports:
        - name: http
          port: 80
          protocol: TCP
          targetPort: 80
      selector:
        app: example
      type: ClusterIP
    
  5. get the ingress gateway ip, setup /etc/hosts, validate

    ip=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[*].ip}')
    echo $ip
    
  6. deploy a new istio operator 1.7.8, it’s gonna try to start istiod using the 1.6 IstioOperator, but fail due to unsupported options:

    istioctl-1.7.8 operator init -r 1-7-8
    
  7. create new control planes using the new operator revision:

    ---
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    metadata:
      name: primary-1-7-8
      namespace: istio-system
    spec:
      profile: minimal
      revision: 1-7-8
    
    ---
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    metadata:
      name: gateways-1-7-8
      namespace: istio-system
    spec:
      profile: empty
      revision: 1-7-8
      components:
        ingressGateways:
          - enabled: false          # to be enabled later
            name: istio-ingressgateway
    
  8. Upgrade the workloads to the primary-1-7-8 control plane

    kubectl label namespace default istio.io/rev=1-7-8 --overwrite
    
  9. Restart the deployment, validate

    kubectl rollout restart deployment/example
    
  10. Upgrade the ingress gateway deployment, validate

    kubectl patch istiooperator -n istio-system gateways-1-7-8 --type='json' -p='[{"op": "replace", "path": "/spec/components/ingressGateways/0/enabled", "value": true}]'
    istioctl-1.7.8 proxy-status
    
  11. remove the gke istio operator. don’t panic: this nukes the operator namespace, but you just need to reinstall the 1-7-8 revision operator:

    istioctl-1.6.14 operator remove
    istioctl-1.7.8 operator init -r 1-7-8
    
  12. scale the 1.4 control plane to zero, cleanup misc stuff:

    kubectl scale deploy -n istio-system --replicas=0 istio-citadel istio-galley istio-pilot istio-policy istio-sidecar-injector istio-telemetry promsd
    kubectl -n istio-system delete job istio-security-post-install-1.4.10-gke.8
    
  13. POSSIBLE PROBLEM: ingressgateway was still using container 1.6.14-gke.1. Eventually, something kicked it - maybe when installing istio operator 1.8?

  14. upgrade to 1.8, rollover workload, rollover gateway, remove 1.7

    ---
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    metadata:
      name: primary-1-8-6
      namespace: istio-system
    spec:
      profile: minimal
      revision: 1-8-6
    
    ---
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    metadata:
      name: gateways-1-8-6
      namespace: istio-system
    spec:
      components:
        ingressGateways:
          - enabled: true
            name: istio-ingressgateway
      profile: empty
      revision: 1-8-6
    
  15. upgrade to 1.9

    ---
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    metadata:
      name: primary-1-9-5
      namespace: istio-system
    spec:
      profile: minimal
      revision: 1-9-5
    
    ---
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    metadata:
      name: gateways-1-9-5
      namespace: istio-system
    spec:
      components:
        ingressGateways:
          - enabled: true
            name: istio-ingressgateway
      profile: empty
      revision: 1-9-5
    
  16. upgrade to 1.10

    ---
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    metadata:
      name: primary-1-10-1
      namespace: istio-system
    spec:
      profile: minimal
      revision: 1-10-1
    
    ---
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    metadata:
      name: gateways-1-10-1
      namespace: istio-system
    spec:
      components:
        ingressGateways:
          - enabled: true
            name: istio-ingressgateway
      profile: empty
      revision: 1-10-1
    
  17. Review cluster resources - istio operator didn’t correctly clean up my old revisions, I had to delete the istiod deployments, services, hpa, and webhook configs myself.

I haven’t done this on a production cluster yet, but probably will soonish.

1 Like

Hi @rvandegrift

Thanks for your answer. Thats awesome.

I’ll look into it next week and probably test it on a test cluster.

I really appreciate your detailed answer. Thanks :slight_smile:

hi @BoHenriksen - I tried running this on a real cluster and ran into some problems.

After the canary deployment of istio operator 1.7.8, I changed a namespace to point to the new 1-7-8 rev.

The service came back fine, but it’s still getting the 1.6.14-gke.1 sidecar injection. The istio-proxy’s CA_ADDR and discoveryAddress point to the 1-7-8 control plane. So I started to dig in.

The istiod-1-7-8 deployment is labeled operator.istio.io/version=1.6.14-gke.1 and using discovery image gcr.io/gke-release/istio/pilot:1.6.14-gke.1 I’m not clear why - everything else in the deployment points to rev 1-7-8.

The mutatingwebhookconfig for 1-7-8 is also labeled operator.istio.io/version=1.6.14-gke.1. Not sure what happened there - again, everything else reflects that it should be 1-7-8.

Since mixing versions seems like it’ll be bad, I’ve rolled the namespace back to istio-1611. I’ll probably delete 1-7-8 and try again.

One issue that jumps out - I noticed that istioctl 1.6.14 doesn’t appear to have revision support yet. So it could be that the 1.6->1.7 upgrade has to be done in-place?

Hi @rvandegrift

I have done the upgrade one time on a test cluster following your approach and I had the same issue as you describe.

What I did to solve it was to delete the istio-operator. It will then be recreated and start a new pod which is using the image for version 1-7-8. After that I also had to restart the workloads for the application and then everything worked.

Going from version 1-7-8 to 1.8.6 I also had some issues. 1. issue was that primary-1-8-6 istio operator reconciled with an error. To solve that I deleted the primary-1-8-6 istio operator and redeployed it. It then reconciled with a healthy status.

The second issue was that the istio-ingressgateway was never updated to version 1-8-6. I guess it’s because of this issue. To solve it I did this:

  1. scale down istio-operator-1-7-8 and istiod-1-7-8 to 0.
  2. Delete the pod istio-operator-1-8-6.
  3. Delete the pod istio-ingressgateway.

The ingressgateway then started up with version 1.8.6.

Upgrading from 1.8.6 to 10.2 was a breeze with no problems.

As you mention I think the upgrade from 1.6 to 1.7 has to be done in-place.

We have talked about uninstalling Istio and then install it with the latest version. I have tried that one time and that worked out fine. Basically following the uninstall process (you still have to delete the clusterroles and clusterrolebindings manually). Then install Istio again.

We still have to test both methods to be more confident that it works every time and decide witch method to go with but for now the holiday season is coming and this task has been set on pause.

Hi @BoHenriksen and @rvandegrift,

I work on Istio at Google Cloud, and I’d like to encourage you to give Anthos Service Mesh another look.

  1. Lots of manual work required on top of managing the mesh - including upgrades. (improvements to this were released yesterday, but I haven’t dug in)

Please do! It’s going to look very familiar to you if you’ve been following the canary control plane deployment model and revision/tag-based upgrades

  1. Lots of untrustworthy setup steps: “curl | kubectl” isn’t cool.

The setup is currently done with a script that you’re encouraged to read and understand before blindly running!

You’ll be pleased to know we’re currently testing a new and improved installation experience, which we hope to release very soon.

  1. Substantially more complex than istio itself, both from the operations and purchasing points of view.

ASM manages the control plane and certificate authority for you, so it’s substantially simpler than Istio.

Purchasing is also trivial: enable the API! You don’t need to be an Anthos subscriber (not obvious from the name, that’s true.) It’s even free until October 2021.

  1. After being burned by Istio on GKE, it’s hard to trust another Google-managed version of this again.

Mea culpa. We have been focused on improving the Istio open-source installation experience, and working on our managed product. “Istio on GKE” sat in the middle of those two options, tied to GKE releases, and so we are now working to move people to whichever option suits them best. If that’s open source Istio, as called out on the rest of the thread, that’s great. We’d love it if you’d given ASM a look.

Please ping me if you have any questions!

Hi @craigbox

Thanks for your input.

I’m just looking at the pricing of Anthos Service Mesh. It includes 100 Anthos Service Mesh clients on a cluster. An Anthos Service Mesh client is that the same as a sidecar?

Yes; I guess the language is vague because there could conceivably be support for other XDS clients like gRPC endpoints. :slightly_smiling_face: