Cloud INSIDER

Stay connected. Stay informed.
Contact Us

Training a ResNet-50 ImageNet Model using PyTorch on a Single AWS g4 or p3 Instance

Introduction

This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch”, a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances).  The guide, which is intended for anyone from beginners just starting out, to skilled practitioners,  focuses on deciding the right platform for the machine learning model that you want to deploy. 

The below tutorial is for people who have determined that a single node AWS g4 or p3 Instance is right for their machine learning workload.

Prepping the Model

Now that you have chosen to start with the g4 or p3 instance, it is time to prepare your model for training. In this tutorial, you will learn how to set up a training model for image recognition using ResNet-50. 

Please note that the setup process is the exact same for running on a g4 or a p3 instance. Simply choose your instance type and move forward. (In this case you will be selecting any one of the following single-node instances: g4dn.xlarge, g4dn.2xlarge, g4dn.4xlarge, g4dn.8xlarge, g4dn.16xlarge, p3.2xlarge.)

At this point you will need:

Your technology stack will include the following:

  • Model to train: ResNet-50
  • Machine learning framework: TensorFlow
  • Distributed training framework: Horovod
  • Single node – Single/Multi GPU
  • Instance: p3.2xlarge or greater
  • AMI: Deep Learning AMI (v33 – aLinux)

The AWS provided Deep Learning AMI is the best ready-to-use AMI for getting started with AI/ML on AWS. It is already packaged with ML and HPC software, examples, and drivers. https://aws.amazon.com/machine-learning/amis/

 

If you want to use our prebaked AMI for this tutorial, use the following AMI ID: ami-0e22bababb010e6c5 (us-east-1) in the point 3.c and skip step 5.

Step 1 - Sign In to the AWS Management Console.

Step 2 - Open the EC2 console

Search “EC2” in “Find Services’ and click on it.

Step 3 - Launch a new instance

a) Find the “Launch instance” box and click on the “Launch instance” button. Then, again on “Launch instance”.

b) Search “deep learning” in the search box and press Enter.

(if you do not want to learn how to make an AMI and how to download and prepare a dataset, use our public AMI for this tutorial and skip ahead to step 6 of this training: ami-0e22bababb010e6c5 (us-east-1)

c) Select the last version of Deep Learning AMI: “Deep Learning AMI (Amazon Linux 2) Version 35.0”. (The version number could change)

There are many AMIs with similar names, double check that you have selected the correct one.

d) Filter by GPU instances and choose a p3.2xlarge instance.

Then, press “Next: Configure Instance Details”.

e) Configure Instance Details

Only change the Subnet option to: “default in us-east-1a.”

Then, press “Next: Add Storage”.

f) Add Storage

Set Size (GiB): 1024.

Then, press “Next: Add Tags”.

g) Add Tags

Click on “Add Tag” and create a Tag with:
Key: Name
Value: p3 – Node 1

Then, press “Next: Configure Security Group”.

h) Configure Security Group

Select: Create a new security group

Add the name and description.

In “Source”, select “My IP”.

Finally, press “Review and Launch”.

i) Review Instance and press “Launch”

j) Create a New Key pair

Enter a name for the key pair.

Press “Download Key Pair” (Please, store it in a secure and accessible location, you can’t download it again).

k) Launch Status

Click “View Instance” to see your instance status.

l) Find your instance’s IPv4 Public IP and copy it.

Step 4 - Connect to your instance

a) Open a terminal

b) Move to the directory where you downloaded the key pairs (*.pem). Always, replace bold text with your information.

cd <key_pair_directory>

c) Change the permissions in your .pem file

chmod 400 <your .pem filename>

d) Connect to your instance using SSH.Connect to your instance using SSH.

ssh -i <your .pem filename> ec2-user@<your instance IPv4 Public IP>

Step 5 - Download the ImageNet dataset

a) Once inside the instance, create a new directory in the home directory: data (with tf-imagenet subfolder) and dataset, and enter in dataset.

mkdir -p ~/data/tf-imagenet/
mkdir ~/dataset
cd ~/dataset

b) Download the training file (138GB).

Create train folder and extract the file with the following command:

mkdir train
tar xvf ILSVRC2012_img_train.tar -C train

Create the file untar.sh with the following data inside.

#!/bin/bash

for filename in *.tar; do

  DIRNAME=”${filename%.*}”

  echo $DIRNAME

  mkdir -p /home/ec2-user/dataset/train/$DIRNAME

  tar xvf $filename -C /home/ec2-user/dataset/train/$DIRNAME

done

Move to train folder, untar all the files with the script

cd train

bash ../untar.sh

Remove the *.tar files and return to dataset folder

rm *.tar

cd ..

c) Download the validation files (6GB).

Create validation folder and extract the file with the following command:

mkdir validation
tar xvf ILSVRC2012_img_val.tar -C validation

d) Download the labels file.

e) Use the image preprocessing script to generate a TFRecord format dataset from the raw ImageNet dataset.

cd ~/examples/horovod/tensorflow/utils

nohup python preprocess_imagenet.py –raw_data_dir=/home/ec2-user/dataset/ –local_scratch_dir=/home/ec2-user/data/ &

f) Use the image resizing script.

cd ~/examples/horovod/tensorflow/utils

nohup python tensorflow_image_resizer.py -d imagenet -i /home/ec2-user/data/train -o /home/ec2-user/data/tf-imagenet/ –subset_name train &

g) Move the validation data generated by the preprocessing script into ~/data/tf_imagenet

mv ~/data/validation/* ~/data/tf-imagenet

h) Finally, the directory ~/data/tf-imagenet must have all the files (train-* and validation-*), without subdirectories. It should look like the following:

Step 6 - Train the model

a) Move to the following folder:

cd ~/examples/horovod/tensorflow

b) Use vim to edit the hosts file

vim hosts

The file must be:
localhost slots=1

c) Now, run the script to start training the model.

./train.sh 1

After a few seconds you will see the results.

Avg speed: 130
Avg speed: 130

d) When finished or canceled, stop the instance.

      1. Go to EC2 Dashboard in AWS Console.
      2. Right click in the instance, go to Instance State and click on Stop.

Getting Help

Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at sales@sixninesit.com.


About Author: Matthew Brucker

All Comments


    Write a Comment

    What do you think?

    Contact Us Today

    For a free consultation.

    Contact Us
    %d bloggers like this: