Model training

Model Training

While Jupyter notebook is good for interactive model training, you may like to package the training code as Docker image and run it in Amazon EKS cluster.

This chapter explains how to build a training model for Fashion-MNIST dataset using TensorFlow and Keras on Amazon EKS. This databset contains 70,000 grayscale images in 10 categories and is meant to be a drop-in replace of MNIST.

Docker image

You can use a pre-built Docker image seedjeffwan/mnist_tensorflow_keras:1.13.1. This image uses tensorflow/tensorflow:1.13.1 as the base image. The image has training code and downloads training and test data sets. It also stores the generated model in an S3 bucket.

Alternatively, you can use Dockerfile to build the image:

docker build -t <dockerhub_username>/<repo_name>:<tag_name> .

Create S3 bucket

Create an S3 bucket where training model will be saved:

export S3_BUCKET=eks-ml-data
aws s3 mb s3://$S3_BUCKET --region $AWS_REGION

This name will be used in the pod specification later. This bucket is also used for serving the model.

If you want to use an existing bucket in a different region, then make sure to specify the exact region as the value of AWS_REGION environment variable in mnist-training.yaml.

Setup AWS credentials in EKS cluster

AWS credentials are required to save model on S3 bucket. These credentials are stored in EKS cluster as Kubernetes secrets.

Get your AWS access key id and secret access key.

Replace AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in the following command with values specific to your environment.

export AWS_ACCESS_KEY_ID_VALUE=$(echo -n 'AWS_ACCESS_KEY_ID' | base64)
export AWS_SECRET_ACCESS_KEY_VALUE=$(echo -n 'AWS_SECRET_ACCESS_KEY' | base64)

Apply to EKS cluster:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: aws-secret
type: Opaque
data:
  AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID_VALUE
  AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY_VALUE
EOF

Run training using pod

Create pod:

curl -LO https://eksworkshop.com/kubeflow/kubeflow.files/mnist-training.yaml
envsubst <mnist-training.yaml | kubectl create -f -

This will start a pod which will start the training and save the generated model in S3 bucket. Check status:

kubectl get pods
NAME              READY   STATUS    RESTARTS   AGE
mnist-training    1/1     Running   0          2m45s
Expand here to see complete logs

The last line shows that the exported model is saved to S3 bucket.