Cloud INSIDER

Stay connected. Stay informed.
Contact Us

Object Detection Training using mask-R-cnn on AWS p3dn instances

Introduction

This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch” a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances).  The guide, which is intended for anyone from beginners just starting out, to skilled practitioners,  focuses on deciding the right platform for the machine learning model that you want to deploy. 

The below tutorial is for people who have determined that an AWS p3dn Instance is right for their machine learning workload. This document will focus on object detection training using mask-R-cnn. 

Prepping the Model

In this tutorial, you will be setting up for object detection training using the AWS p3dn instance (p3dn.24xlarge). 

As with the previous sections, you will need the following:

The mask-R-cnn model is a conceptually simple, flexible, and general framework for object instance segmentation.

To train Mask R-CNN we are going to use this stack:

  • ParallelCluster 2.8.1 aLinux 2 base AMI.
  • CUDA 10.1 
  • MXNet
  • Horovod
  • COCO dataset

If you want to use our prebaked AMI to run the training, please skip ahead to step 6. 

Step 1 - Launch an instance with the AMI you saved on step 6 from tutorial 3 or use our prebaked AMI (ami-03d5c1f8c6b62def5 | us-east-1) and connect to it using ssh.

Step 2 - Install MXNet, horovod, GluonCV and packages to get COCO dataset

sudo yum install python3 python3-devel
pip3 install virtualenv –user
cd $HOME/src
virtualenv mask-r-cnn
source mask-r-cnn/bin/activate

pip install mxnet-cu101mkl
export PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/opt/amazon/efa/lib64:/usr/local/cuda/lib64:/home/ec2-user/src/nccl/build/lib:/home/ec2-user/src/aws-ofi-nccl/out/lib:$LD_LIBRARY_PATH

HOROVOD_WITH_MPI=1 HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITH_MXNET=1 HOROVOD_GPU_OPERATIONS=NCCL
HOROVOD_NCCL_INCLUDE=/home/ec2-user/src/nccl/build/include
HOROVOD_NCCL_LIB=/home/ec2-user/src/nccl/build/lib pip install horovod –no-cache-dir

pip install pycocotools
pip install pybind11==2.4.3
git clone –recursive https://github.com/NVIDIA/cocoapi.git
cd cocoapi/PythonAPI/
python setup.py build_ext install
pip uninstall pycocotools

pip install gluoncv

Step 3 - Get coco dataset

cd /home/ec2-user/src
git clone –recursive https://github.com/dmlc/gluon-cv.git gluoncv
cd gluoncv/scripts/datasets
python mscoco.py

Step 4 - Run Files

Run the following command to create the files needed to run the model:

cat > /home/ec2-user/run_mask.sh << EOF

#!/bin/bash

source /home/ec2-user/mask-r-cnn/bin/activate

INSTANCE_TYPE=`curl http://169.254.169.254/latest/meta-data/instance-type`

#p3dn.24xlarge has 32 GB GPU

if [ $INSTANCE_TYPE == “p3dn.24xlarge” ]; then WORLD_SIZE=$(($WORLD_SIZE * 2)); fi

export PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/:bin:/usr/local/cuda/bin:$PATH

export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/opt/amazon/efa/lib64:/usr/local/cuda/lib64:/home/ec2-user/src/nccl/build/lib:/home/ec2-user/src/aws-ofi-nccl/out/lib:$LD_LIBRARY_PATH

python -u /home/ec2-user/src/gluoncv/scripts/instance/mask_rcnn/train_mask_rcnn.py \
–horovod –amp –lr-decay-epoch 8,10 –epochs 12 –log-interval 100 \
–val-interval 1 –batch-size $WORLD_SIZE –use-fpn –lr 0.02 \
–lr-warmup-factor 0.03 –lr-warmup 1000 –static-alloc \
–clip-gradient 1.5 –use-ext
EOF

chmod +x run_mask.sh
cat > /home/ec2-user/mask.slurm << EOF
#!/bin/bash
#SBATCH –job-name=ml-mask-r-cnn
#SBATCH –nodes=2
#SBATCH –ntasks-per-node=8
SECONDS=0

export HOROVOD_FUSION_THRESHOLD=134217728
export HOROVOD_NUM_STREAMS=2
export MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD=999
export MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD=25
export OMP_NUM_THREADS=2

export HOROVOD_CYCLE_TIME=0.1
export HOROVOD_HIERARCHICAL_ALLREDUCE=0
export HOROVOD_CACHE_CAPACITY=0

export NCCL_MIN_NRINGS=1
export NCCL_TREE_THRESHOLD=4294967296
export NCCL_NSOCKS_PERTHREAD=8
export NCCL_SOCKET_NTHREADS=2
export NCCL_BUFFSIZE=16777216
export HOROVOD_NUM_NCCL_STREAMS=2

export NCCL_NET_GDR_READ=1
export HOROVOD_TWO_STAGE_LOOP=1
export HOROVOD_ALLREDUCE_MODE=1
export HOROVOD_FIXED_PAYLOAD=161
export HOROVOD_MPI_THREADS_DISABLE=1
export MXNET_USE_FUSION=0
export NCCL_DEBUG=INFO

export WORLD_SIZE=$(($SLURM_NNODES*$SLURM_NTASKS_PER_NODE))

echo $WORLD_SIZE

export PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/:bin:/usr/local/cuda/bin:$PATH

export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/opt/amazon/efa/lib64:/usr/local/cuda/lib64:/home/ec2-user/src/nccl/build/lib:/home/ec2-user/src/aws-ofi-nccl/out/lib:$LD_LIBRARY_PATH

printf ‘\n%s:  %s\n\n’ “$(date +%T)” “Begin execution”

mpirun  –allow-run-as-root \
        -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib \
        run_mask.sh

duration=$SECONDS
echo “$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed.”
printf ‘%s:  %s\n\n’ “$(date +%T)” “End execution”
EOF

Step 5 - pcluster AMI for MXNet and COCO

Once you have finished with step 7, go to the EC2 console and stop your instance.
When the instance changes its state to ‘stopped’, select your instance, click on ‘actions->image->create image’. Set a name for your AMI and click on ‘create imagen’.

Step 6 - Launch a new cluster using ParallelCluster

Assuming you completed the previous parts of this tutorial, you already installed AWS-ParallelCluster.

Modify your pcluster config file, located in ‘$HOME/.parallelcluster/config’ and replace ‘custom_ami’ with your new AMI id.

If you want to use our prebaked AMI for this tutorial, use the following AMI id:

Ami-04017adfe9f340cae (us-east-1)

Step 7 - Connect to your cluster using ssh

Step 8 - Run the mask-r-cnn model

To run the model, open the script ‘mask.slurm’ and use the following variables to match your cluster properties:

to change GPU numbers per node change the flag –ntasks-per-node=8
to change number of nodes –nodes=2

For reference this are possible values for –ntasks-per-node:

p3.2xlarge (1 Tesla V100 16 GB)     –ntasks-per-node=1
p3.8xlarge (4 Tesla V100 16 GB)     –ntasks-per-node=4
p3.16xlarge (8 Tesla V100 16 GB)    –ntasks-per-node=8
p3dn.24xlarge (8 Tesla V100 32 GB)  –ntasks-per-node=8

then just run:

$ sbatch mask.slurm

If you want to see the output of the log:

$ tail -f slurm-2.log

(replace slurm-2 for the correct slurm job id)

Step 9 - Using p3dns alongside EFA

To enable EFA in your cluster you must edit the pcluster config file again and add the following variables:

placement_group=DYNAMIC

enable_efa=compute

master_instance_type = c5n.18xlarge

compute_instance_type = p3dn.24xlarge

Once you launched your new cluster with EFA enabled, we can run a few tests to confirm that EFA is enabled:

 $ /opt/amazon/efa/bin/fi_info -p efa
    $ cd /home/ec2-user/src/nccl-tests/
    $ LD_LIBRARY_PATH=/home/ec2-user/src/aws-ofi-nccl/out/lib/:/home/ec2-user/src/nccl/build/lib/:/usr/local/cuda/lib64:$LD_LIBRARY_PATH NCCL_DEBUG=info ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

The last step is to add the EFA flags to the mask.slurm file:

    export FI_PROVIDER=”efa”
    export FI_EFA_TX_MIN_CREDITS=64

The file:

#!/bin/bash
#SBATCH –job-name=ml-mask-r-cnn
#SBATCH –nodes=2
#SBATCH –ntasks-per-node=8
SECONDS=0

export HOROVOD_FUSION_THRESHOLD=134217728
export HOROVOD_NUM_STREAMS=2
export MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD=999
export MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD=25
export OMP_NUM_THREADS=2

export HOROVOD_CYCLE_TIME=0.1
export HOROVOD_HIERARCHICAL_ALLREDUCE=0
export HOROVOD_CACHE_CAPACITY=0

export NCCL_MIN_NRINGS=1
export NCCL_TREE_THRESHOLD=4294967296
export NCCL_NSOCKS_PERTHREAD=8
export NCCL_SOCKET_NTHREADS=2
export NCCL_BUFFSIZE=16777216
export HOROVOD_NUM_NCCL_STREAMS=2

export NCCL_NET_GDR_READ=1
export HOROVOD_TWO_STAGE_LOOP=1
export HOROVOD_ALLREDUCE_MODE=1
export HOROVOD_FIXED_PAYLOAD=161
export HOROVOD_MPI_THREADS_DISABLE=1
export MXNET_USE_FUSION=0
export NCCL_DEBUG=INFO

export WORLD_SIZE=$(($SLURM_NNODES*$SLURM_NTASKS_PER_NODE))

echo $WORLD_SIZE

export PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/:bin:/usr/local/cuda/bin:$PATH

export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/opt/amazon/efa/lib64:/usr/local/cuda/lib64:/home/ec2-user/src/nccl/build/lib:/home/ec2-user/src/aws-ofi-nccl/out/lib:$LD_LIBRARY_PATH

export FI_PROVIDER=”efa”
export FI_EFA_TX_MIN_CREDITS=64

printf ‘\n%s:  %s\n\n’ “$(date +%T)” “Begin execution”

mpirun  –allow-run-as-root \
        -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib \
        run_mask.sh

duration=$SECONDS
echo “$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed.”
printf ‘%s:  %s\n\n’ “$(date +%T)” “End execution”

Step 10 - The Results

This are the samples per second measured in different scenarios:

2 x p3.8xlarge (8 GPU)     batch_size=8   ->  33.000 samples/s

4 x p3.16xlarge (32 GPU)   batch_size=32   ->  120.000 samples/s

2 x p3dn.24xlarge (16 GPU) batch_size=16   ->  91.000 samples/s

2 x p3dn.24xlarge (16 GPU) batch_size=32   ->  111.000 samples/s

4 x p3dn.24xlarge (32 GPU) batch_size=64   ->  210.000 samples/s

We can see that 2 p3dn.24xlarge making use of the enhanced 100G/s EFA adapter are getting almost the same samples per second (111k) than 4 p3.16xlarge (120k).

Given that the p3dns are 30% more expensive than the p3.16xlarge, using the p3dns for a training using mask-r-cnn seems the right choice in time or in cost savings.

Getting Help

Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at sales@sixninesit.com.


About Author: Matthew Brucker

All Comments


    Write a Comment

    What do you think?

    Contact Us Today

    For a free consultation.

    Contact Us
    %d bloggers like this: