Training a ResNet-50 ImageNet Model using PyTorch on a Single AWS g4 or p3 Instance
Introduction
This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch”, a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances). The guide, which is intended for anyone from beginners just starting out, to skilled practitioners, focuses on deciding the right platform for the machine learning model that you want to deploy.
The below tutorial is for people who have determined that a single node AWS g4 or p3 Instance is right for their machine learning workload.
Prepping the Model
Now that you have chosen to start with the g4 or p3 instance, it is time to prepare your model for training. In this tutorial, you will learn how to set up a training model for image recognition using ResNet-50.
Please note that the setup process is the exact same for running on a g4 or a p3 instance. Simply choose your instance type and move forward. (In this case you will be selecting any one of the following single-node instances: g4dn.xlarge, g4dn.2xlarge, g4dn.4xlarge, g4dn.8xlarge, g4dn.16xlarge, p3.2xlarge.)
At this point you will need:
- An AWS Account with console and programmatic access
- An AWS Command Line Interface (CLI) with credentials set up
Your technology stack will include the following:
- Model to train: ResNet-50
- Machine learning framework: TensorFlow
- Distributed training framework: Horovod
- Single node – Single/Multi GPU
- Instance: p3.2xlarge or greater
- AMI: Deep Learning AMI (v33 – aLinux)
The AWS provided Deep Learning AMI is the best ready-to-use AMI for getting started with AI/ML on AWS. It is already packaged with ML and HPC software, examples, and drivers. https://aws.amazon.com/machine-learning/amis/
If you want to use our prebaked AMI for this tutorial, use the following AMI ID: ami-0e22bababb010e6c5 (us-east-1) in the point 3.c and skip step 5.
Step 1 - Sign In to the AWS Management Console.
Step 2 - Open the EC2 console
Search “EC2” in “Find Services’ and click on it.

Step 3 - Launch a new instance
a) Find the “Launch instance” box and click on the “Launch instance” button. Then, again on “Launch instance”.

b) Search “deep learning” in the search box and press Enter.
(if you do not want to learn how to make an AMI and how to download and prepare a dataset, use our public AMI for this tutorial and skip ahead to step 6 of this training: ami-0e22bababb010e6c5 (us-east-1)

c) Select the last version of Deep Learning AMI: “Deep Learning AMI (Amazon Linux 2) Version 35.0”. (The version number could change)
There are many AMIs with similar names, double check that you have selected the correct one.

d) Filter by GPU instances and choose a p3.2xlarge instance.
Then, press “Next: Configure Instance Details”.

e) Configure Instance Details
Only change the Subnet option to: “default in us-east-1a.”
Then, press “Next: Add Storage”.

f) Add Storage
Set Size (GiB): 1024.
Then, press “Next: Add Tags”.

g) Add Tags
Click on “Add Tag” and create a Tag with:
Key: Name
Value: p3 – Node 1
Then, press “Next: Configure Security Group”.
h) Configure Security Group
Select: Create a new security group
Add the name and description.
In “Source”, select “My IP”.
Finally, press “Review and Launch”.

i) Review Instance and press “Launch”

j) Create a New Key pair
Enter a name for the key pair.
Press “Download Key Pair” (Please, store it in a secure and accessible location, you can’t download it again).

k) Launch Status
Click “View Instance” to see your instance status.

l) Find your instance’s IPv4 Public IP and copy it.

Step 4 - Connect to your instance
a) Open a terminal
b) Move to the directory where you downloaded the key pairs (*.pem). Always, replace bold text with your information.
cd <key_pair_directory>
c) Change the permissions in your .pem file
chmod 400 <your .pem filename>
d) Connect to your instance using SSH.Connect to your instance using SSH.
ssh -i <your .pem filename> ec2-user@<your instance IPv4 Public IP>
Step 5 - Download the ImageNet dataset
a) Once inside the instance, create a new directory in the home directory: data (with tf-imagenet subfolder) and dataset, and enter in dataset.
mkdir -p ~/data/tf-imagenet/
mkdir ~/dataset
cd ~/dataset
b) Download the training file (138GB).
Create train folder and extract the file with the following command:
mkdir train
tar xvf ILSVRC2012_img_train.tar -C train
Create the file untar.sh with the following data inside.
#!/bin/bash
for filename in *.tar; do
DIRNAME=”${filename%.*}”
echo $DIRNAME
mkdir -p /home/ec2-user/dataset/train/$DIRNAME
tar xvf $filename -C /home/ec2-user/dataset/train/$DIRNAME
done
Move to train folder, untar all the files with the script
cd train
bash ../untar.sh
Remove the *.tar files and return to dataset folder
rm *.tar
cd ..
c) Download the validation files (6GB).
Create validation folder and extract the file with the following command:
mkdir validation
tar xvf ILSVRC2012_img_val.tar -C validation
d) Download the labels file.
e) Use the image preprocessing script to generate a TFRecord format dataset from the raw ImageNet dataset.
cd ~/examples/horovod/tensorflow/utils
nohup python preprocess_imagenet.py –raw_data_dir=/home/ec2-user/dataset/ –local_scratch_dir=/home/ec2-user/data/ &
f) Use the image resizing script.
cd ~/examples/horovod/tensorflow/utils
nohup python tensorflow_image_resizer.py -d imagenet -i /home/ec2-user/data/train -o /home/ec2-user/data/tf-imagenet/ –subset_name train &
g) Move the validation data generated by the preprocessing script into ~/data/tf_imagenet
mv ~/data/validation/* ~/data/tf-imagenet
h) Finally, the directory ~/data/tf-imagenet must have all the files (train-* and validation-*), without subdirectories. It should look like the following:

Step 6 - Train the model
a) Move to the following folder:
cd ~/examples/horovod/tensorflow
b) Use vim to edit the hosts file
vim hosts
The file must be:
localhost slots=1
c) Now, run the script to start training the model.
./train.sh 1
After a few seconds you will see the results.

d) When finished or canceled, stop the instance.
- Go to EC2 Dashboard in AWS Console.
- Right click in the instance, go to Instance State and click on Stop.

Getting Help
Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at sales@sixninesit.com.
All Comments
Write a Comment