Install

In this chapter, we will install Kubeflow on Amazon EKS cluster. If you don’t have an EKS cluster, please follow instructions from getting started guide and then launch your EKS cluster using eksctl chapter

Increase cluster size

We need more resources for completing Kubeflow chapters. Let’s go and increase the size of our cluster

eksctl get nodegroups --cluster eksworkshop-eksctl

Take a note of your nodegroup name and use it to scale your cluster. For ex, here is my nodegroup

CLUSTER                 NODEGROUP       CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY    INSTANCE TYPE   IMAGE ID
eksworkshop-eksctl      ng-52c1fb5e     2019-11-05T16:22:50Z    3               3               3          m

Currently my cluster has 3 nodes in ng-52c1fb5e nodegroup. In order to scale my cluster to 6, I use this command

export NODEGROUP_NAME=ng-52c1fb5e
eksctl scale nodegroup --cluster eksworkshop-eksctl --name $NODEGROUP_NAME --nodes 6

Scaling nodegroup will take 2 - 3 minutes

Install Kubeflow on Amazon EKS

Download 0.7 RC6 release of kfctl. This binary will allow you to install Kubeflow on Amazon EKS:

curl --silent --location "https://github.com/kubeflow/kubeflow/releases/download/v0.7.0/kfctl_v0.7.0_linux.tar.gz" | tar xz -C /tmp
sudo mv -v /tmp/kfctl /usr/local/bin

Export Kubeflow configuration file:

export CONFIG_URI=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_aws.0.7.0.yaml

Customize your configuration

Set an environment variable for your AWS cluster name, and set the name of the Kubeflow deployment to the same as the cluster name.

export AWS_CLUSTER_NAME=eksworkshop-eksctl
export KF_NAME=${AWS_CLUSTER_NAME}

Set the path to the base directory where you want to store Kubeflow deployments. Then set the Kubeflow application directory for this deployment.

export BASE_DIR=~/environment
export KF_DIR=${BASE_DIR}/${KF_NAME}

Until https://github.com/kubeflow/kubeflow/issues/3827 is fixed, install aws-iam-authenticator:

curl -o aws-iam-authenticator https://amazon-eks.s3-us-west-2.amazonaws.com/1.13.7/2019-06-11/bin/linux/amd64/aws-iam-authenticator
chmod +x aws-iam-authenticator
sudo mv aws-iam-authenticator /usr/local/bin

Run kfctl build command to set up your configuraiton

mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl build -V -f ${CONFIG_URI}

Set an environment variable pointing to your local configuration file

export CONFIG_FILE=${KF_DIR}/kfctl_aws.0.7.0.yaml

Replace EKS Cluster Name, AWS Region and IAM Roles in your $(CONFIG_FILE)

sed -i "s@eksctl-eksworkshop-eksctl-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx@$ROLE_NAME@" ${CONFIG_FILE}
sed -i -e 's/kubeflow-aws/'"$AWS_CLUSTER_NAME"'/' ${CONFIG_FILE}
sed -i "s@us-west-2@$AWS_REGION@" ${CONFIG_FILE}

Deploy Kubeflow

Apply configuration and deploy Kubeflow on your cluster:

rm -rf kustomize
kfctl apply -V -f ${CONFIG_FILE}

Wait for all pods to be in Running state (this can take a few minutes):

kubectl get pods -n kubeflow

You should see similar results

NAME                                                           READY   STATUS    RESTARTS   AGE
admission-webhook-bootstrap-stateful-set-0                     1/1     Running   0          5m19s
admission-webhook-deployment-78d899bf68-bszdj                  1/1     Running   0          4m20s
alb-ingress-controller-6868b86fbf-dwjvm                        1/1     Running   0          5m13s
application-controller-stateful-set-0                          1/1     Running   0          5m20s
argo-ui-55b859f7d7-q5t45                                       1/1     Running   0          5m20s
centraldashboard-75474d6f94-w4smp                              1/1     Running   0          5m19s
jupyter-web-app-deployment-6c8f4c8997-kjwx7                    1/1     Running   0          5m19s
katib-controller-7ddd4c8b8c-ddbmd                              1/1     Running   1          5m16s
katib-db-7b679f6f8c-hlxdn                                      1/1     Running   0          5m16s
katib-manager-84c4fb876b-g758b                                 1/1     Running   0          5m16s
katib-ui-5d454c75c7-ghmh2                                      1/1     Running   0          5m16s
metacontroller-0                                               1/1     Running   0          5m20s
metadata-db-5dd459cc-64tm6                                     1/1     Running   0          5m18s
metadata-deployment-b745d8bcf-jfq8l                            1/1     Running   0          5m18s
metadata-deployment-b745d8bcf-kwn9r                            1/1     Running   0          5m18s
metadata-envoy-deployment-7ccf5c4f74-kl99k                     1/1     Running   0          5m18s
metadata-grpc-deployment-6496f66c8c-clbnq                      1/1     Running   5          5m18s
metadata-grpc-deployment-6496f66c8c-p6vhb                      1/1     Running   5          5m18s
metadata-ui-78f5b59b56-mdvmv                                   1/1     Running   0          5m18s
minio-6f48db9cc4-tvmjc                                         1/1     Running   0          5m16s
ml-pipeline-844645fd-sj8sc                                     1/1     Running   0          5m16s
ml-pipeline-ml-pipeline-visualizationserver-865894f5f7-bv8mk   1/1     Running   0          5m14s
ml-pipeline-persistenceagent-66f89b56d9-4s862                  1/1     Running   0          5m15s
ml-pipeline-scheduledworkflow-57445ddf88-b6np4                 1/1     Running   0          5m15s
ml-pipeline-ui-5c64b6c666-pczbk                                1/1     Running   0          5m15s
ml-pipeline-viewer-controller-deployment-7cc8d77468-l8qdz      1/1     Running   0          5m15s
mpi-operator-5bf8b566b7-92b6n                                  1/1     Running   0          5m13s
mysql-749f87bff5-zk26s                                         1/1     Running   0          5m15s
notebook-controller-deployment-6c887454f7-xr5gx                1/1     Running   0          5m17s
nvidia-device-plugin-daemonset-bhjwh                           1/1     Running   0          5m15s
nvidia-device-plugin-daemonset-ftcdr                           1/1     Running   0          5m15s
nvidia-device-plugin-daemonset-fzd8c                           1/1     Running   0          5m15s
profiles-deployment-67655ddbdd-68h6z                           2/2     Running   0          5m14s
pytorch-operator-84c58df794-xvdg2                              1/1     Running   0          5m17s
seldon-operator-controller-manager-0                           1/1     Running   1          5m16s
spartakus-volunteer-64cb78bbc5-4kb4f                           1/1     Running   0          5m17s
tensorboard-6544748d94-rpvd5                                   1/1     Running   0          5m17s
tf-job-operator-db676465c-vl6vh                                1/1     Running   0          5m17s
workflow-controller-676484d796-t8vjc                           1/1     Running   0          5m19s