Cloud INSIDER

Stay connected. Stay informed.
Contact Us

Training a ResNet-50 ImageNet Model using PyTorch on multiple AWS g4 or p3 Instances

Introduction

This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch”, a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances).  The guide, which is intended for anyone from beginners just starting out, to skilled practitioners,  focuses on deciding the right platform for the machine learning model that you want to deploy. 

The below tutorial is for people who have determined that a multi-node AWS g4 or p3 Instance is right for their machine learning workload.

Prepping the Model

As explained previously in this tutorial, increasing your GPU node count will help speed results, which is where the multi-node g4 and p3 instances come in. 

Just like with the single node, the setup process is the exact same for running on a multi-node g4 or a p3 instance. Simply choose your instance type and move forward. (In this case you will be selecting any one of the following multi-node instances: g4dn.12xlarge, g4dn.metal, p3.8xlarge, p3.16xlarge, or the p3dn.24xlarge.)

 

As with the single node set up, you will need the following:

 

Your technology stack will include the following:

  • Model to train: ResNet-50
  • Machine learning framework: TensorFlow
  • Distributed training framework: Horovod
  • Multi node – Single/Multi GPU
  • Instance: p3.2xlarge or greater
  • AMI: Deep Learning AMI (v33 – aLinux)

Step 1 - Create an AMI image from the Single GPU use case

(If you choose to use our pre-staged AMI on the previous training (Tutorial 1) ami-0e22bababb010e6c5 (us-east-1), please skip ahead and launch a second instance with the same AMI id to step 2.

a) Go to EC2 Dashboard in AWS Console

b) Right click in the instance, go to Instance State, and click on Stop

c) Insert an Image Name and “Create Image”

Step 2 - Create a new instance with the AMI Image created

a) Click on “Launch Instance”

b) Go to My AMIs and select the created or prebacked image

c) Choose Instance Type: p3.2xlarge instance

d) Configure Instance: Select Subnet default us-east-1

e) Add storage: Default

f) Add new Tag with:

Key: Name

Value: p3 – Node 2

g) Security Group: Select the previously created (on Tutorial 1)

h) Review and Launch the instance

i) Select the existing pair

Step 3 - Create a New Security Group

a) In EC2 Dashboard, go to “Security Groups” and click “Create security group”

b) Add security group name, description and set inbound and outbound rules with “All traffic”

c) Press “Create Security Group”

d) Attach this security group to both nodes, p3 – Node 1 and p3 – Node 2

e) Add the new one, without removing the previous security group

Step 4 - Run both instances: p3 - Node 1 and p3 - Node 2

a) Copy the IPv4 Public IP from Node 1

b) And copy the Private IP from Node 2

Step 5 - Now, from your local device copy the .pem certificate created in the previous case with SCP in p3 - Node 1

a) Move to the directory where you downloaded the key pairs (*.pem). Always, replace bold text with your information

cd <key_pair_directory>

b) Copy the key pair to your instance using SCP

scp -i <your .pem filename> <your .pem filename> ec2-user@<your instance IPv4 Public IP>:/home/ec2-user/examples/horovod/tensorflow/

Step 6 - Connect to your first instance (p3 - Node 1)

ssh -i <your .pem filename> ec2-user@<your instance IPv4 Public IP from Node 1>

Step 7 - Train the model

a) Move to the following folder:

cd ~/examples/horovod/tensorflow

b) Use vim to edit the hosts file

vim hosts

The file must be: 

localhost slots=1

<Private IP from Node 2> slots=1

c) Add the SSH key used by the member instances to the ssh-agent

eval `ssh-agent -s`
ssh-add <your .pem filename>.pem

d) Now, run the script to start training the model

./train.sh 2

e) After a few seconds you will see the results

Avg Speed: 200

Step 8 - When finished or canceled, stop or terminate the instance

If you need to try with more nodes and/or GPUs, you must modify the hosts files with the numbers of slots in each node. And then, when you run the script define the total number of GPUs

(./train.sh <num of GPUs>)

Results | Training #1 - ResNet-50 ImageNet Model on Multiple GPU’s

Before starting, is important to identify on the NCCL debug that for instances types apart from p3dn.24xlarge, EFA provider is not supported

Results Test 1

Type: Multi Node – Multi GPU

Number of instances: 2

Instance: p3.8xlarge

  • GPU: 4 GPU NVIDIA Tesla V100
  • GPU Memory: 64 GiB
  • Network Bandwidth: 10 Gbps

Result: Speed/ 50 Steps: ~770

Results Test 2

Type: Single Node – Multi GPU

Number of Instances: 1

Instance: p3.16xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 128 GiB
  • Network Bandwidth: 25 Gbps

Result: Speed/ 50 Steps: ~910

Results Test 3

Type: Multi Node – Multi GPU

Number of Instances: 2

Instance: p3.16xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 128 GiB
  • Network Bandwidth: 25 Gbps

Result: Speed/ 50 Steps: ~11500

Results Test 4

Type: Multi Node – Multi GPU

Number of Instances: 4

Instance: p3.16xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 128 GiB
  • Network Bandwidth: 25 Gbps

Result: Speed/ 50 Steps: ~22500

Results Test 5

Type: Single Node – Multi GPU

Number of Instances: 1

Instance: p3dn.24xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 256 GiB
  • Network Bandwidth: 100 Gbps

Result: Speed/ 50 Steps: ~700

Results Test 6

Type: Multi Node – Multi GPU

Number of Instances: 2

Instance: p3dn.24xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 256 GiB
  • Network Bandwidth: 100 Gbps

Result: Speed/ 50 Steps: ~11180

Results Test 7

Type: Multi Node – Multi GPU

Number of Instances: 2

Instance: p3dn.24xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 256 GiB
  • Network Bandwidth: 100 Gbps

Now that we are using p3dns instances, let us look at the NCCL debug output to find the EFA provider enabled:

Result: Speed / 50 Steps: ~11450

Results Test 8

Type: Multi Node – Multi GPU

Number of Instances: 4

Instance: p3dn.24xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 256 GiB
  • Network Bandwidth: 100 Gbps

Result: Speed / 50 Steps: ~22200

Getting Help

Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at sales@sixninesit.com.


About Author: Matthew Brucker

All Comments


    Write a Comment

    What do you think?

    Contact Us Today

    For a free consultation.

    Contact Us
    %d bloggers like this: