Cloud INSIDER

Stay connected. Stay informed.
Contact Us

Training a BERT Fine Tuning Model using PyTorch on multiple AWS p3 Instances

Introduction

This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch”, a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances).  The guide, which is intended for anyone from beginners just starting out, to skilled practitioners,  focuses on deciding the right platform for the machine learning model that you want to deploy. 

The below tutorial is for people who have determined that a multi-node AWS p3 Instance is right for their machine learning workload. In this tutorial, we will prepare for a BERT fine tuning model.

Prepping the Model

In this tutorial we will be focusing on a large model and we will be making use of amazon EFA to accelerate distributed training. We will be using a stack of p3dn instances to leverage EFA and 8 GPUs per instance.

We will be using RoBERTa, RoBERTa iterates on BERT’s pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. 

As with the previous sections, you will need the following:

To pretrain RoBERTa we are going to use the following stack:

  • ParallelCluster 2.8.1 aLinux 2 base AMI.
  • CUDA 10.1
  • PyTorch
  • Fairseq
  • WikiText-103 dataset

Note: If you do not want to bake your own AMI and use our prebaked AMI, skip ahead to step 11.

Step 1- Launch a p3.2xlarge instance with the ParallelCluster 2.8.1 aLinux AMI, ami-0aa1704150b2ea203 for us-east-1

Step 2 - Connect to the instance through ssh

Step 3 - Install CUDA 10.1

Mkdir -p $HOME/src
cd $HOME/src
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
chmod +x cuda_10.1.243_418.87.00_linux.run
sudo ./cuda_10.1.243_418.87.00_linux.run –silent –override –toolkit –samples –toolkitpath=/usr/local/cuda-10.1 –samplespath=/usr/local/cuda
ln -s /usr/local/cuda-10.1 /usr/local/cuda

Step 4 - Install cuDNN

cd $HOME/src
wget http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/libcudnn7-7.6.5.32-1.cuda10.1.x86_64.rpm
wget http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/libcudnn7-devel-7.6.5.32-1.cuda10.1.x86_64.rpm
sudo yum install ./libcudnn7-7.6.5.32-1.cuda10.1.x86_64.rpm
sudo yum install ./libcudnn7-devel-7.6.5.32-1.cuda10.1.x86_64.rpm
sudo cp /usr/include/cudnn.h /usr/local/cuda/include
sudo cp /usr/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Step 5 - Install nccl, nccl-tests, aws-ofi-nccl

cd $HOME/src
git clone https://github.com/NVIDIA/nccl.git
cd nccl/
git checkout 3701130b3c1bcdb01c14b3cb70fe52498c1e82b7
sudo make -j$(nproc) src.build CUDA_HOME=/usr/local/cuda/
sudo make install
export NCCL_HOME=$(pwd)/build
sudo yum -y install rpm-build rpmdevtools
sudo make pkg.redhat.build

 

cd $HOME/src
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests/
sudo make
LD_LIBRARY_PATH=/home/ec2-user/src/nccl/build/lib/:/usr/local/cuda/lib64:$LD_LIBRARY_PATH NCCL_DEBUG=info ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

cd $HOME/src
git clone https://github.com/aws/aws-ofi-nccl.git
cd aws-ofi-nccl
checkout 48a17d3e1e7a2b22c8c32c864541126ff4a995c7
mkdir -p out
./autogen.sh
echo $NCCL_HOME
./configure –prefix $(pwd)/out –with-libfabric=/opt/amazon/efa/ –with-cuda=/usr/local/cuda/ –with-nccl=$NCCL_HOME –with-mpi=/opt/amazon/openmpi/
sudo make -j$(nproc)

Step 6 - Save AMI

Once you have finished with step 5, go to the EC2 console, and follow the steps to create an AMI. Your ssh connection is going to be closed, wait a few seconds to connect again to your instance.

Select your instance, click on ‘actions->image->create image’. Set a name for your AMI and click on ‘create imagen’.

You have now set up the basics to use most frameworks alongside AWS ParallelCluster, CUDA 10.1, NCCL and the AWS EFA NCCL plugin.

We will use this as a checkpoint. The next tutorial will be using this AMI as starting point.

Step 7 - Installing packages (PyTorch, Fairseq, Apex, PyArrow)

sudo yum install python3 python3-devel
pip3 install virtualenv –user
mkdir -p $HOME/src
cd $HOME/src
virtualenv ml
source ml/bin/activate
pip install torch==1.4.0
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install –editable ./
cd ..
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v –no-cache-dir –global-option=”–cpp_ext” –global-option=”–cuda_ext”   –global-option=”–deprecated_fused_adam” –global-option=”–xentropy”   –global-option=”–fast_multihead_attn” ./
pip install pyarrow
cd $HOME

Step 8 - Getting the dataset

mkdir -p data
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
unzip wikitext-103-v1.zip

cd $HOME/src/fairseq/examples/language_model/

chmod +x preprocess_wiki.sh
./preprocess_wiki.sh

#After this script runs, you should have a ‘data-bin’ folder in $HOME/data

Step 9 - Run Files

cat > roberta_large.sh <<EOF
#!/bin/bash

source /home/ec2-user/src/ml/bin/activate

export NCCL_DEBUG=INFO
export FI_PROVIDER=efa
export FI_OFI_RXR_RX_COPY_UNEXP=1
export FI_OFI_RXR_RX_COPY_OOO=1
export FI_EFA_MR_CACHE_ENABLE=1
export FI_OFI_RXR_INLINE_MR_ENABLE=1
export NCCL_TREE_THRESHOLD=0
#export

#WORLD_SIZE represents the GPUS available in your cluster. For 2 p3.2xlarge, the value is 8.
WORLD_SIZE=8
DIST_PORT=12234
BUCKET_CAP_MB=200
DATABIN=data-bin/wikitext-103
OUTDIR=out
TOTAL_UPDATE=500000

# for P3.2xlarge
MAX_SENTENCES=8
UPDATE_FREQ=1
TOKENS_PER_SAMPLE=128

python /home/ec2-user/src/fairseq/train.py $DATABIN –save-dir $OUTDIR \
  –memory-efficient-fp16 \
  –fast-stat-sync \
  –num-workers 2 \
  –task masked_lm \
  –criterion masked_lm \
  –arch roberta_large \
  –sample-break-mode complete \
  –tokens-per-sample $TOKENS_PER_SAMPLE \
  –optimizer adam –adam-betas ‘(0.9, 0.98)’ –adam-eps 1e-06 –clip-norm 0.0 \
  –lr 0.0006 –lr-scheduler polynomial_decay –warmup-updates 24000 \
  –total-num-update $TOTAL_UPDATE \
  –max-update $TOTAL_UPDATE \
  –dropout 0.1 –attention-dropout 0.1 –weight-decay 0.01 \
  –max-sentences $MAX_SENTENCES \
  –update-freq $UPDATE_FREQ \
  –skip-invalid-size-inputs-valid-test \
  –seed 1 \
  –log-format json –log-interval 25 \
  –distributed-world-size $WORLD_SIZE –distributed-port $DIST_PORT –bucket-cap-mb $BUCKET_CAP_MB 2>&1 | tee $OUTDIR/train.${SLURM_NODEID}.log
EOF

chmod +x roberta_large.sh

#edit nodes if you launch a cluster with more nodes. In this example ntasks-per-node is only one.
cat > roberta.slurm <<EOF
#!/bin/bash
#SBATCH –job-name=roberta_large
#SBATCH –nodes=2
#SBATCH –ntasks-per-node=1
SECONDS=0

printf ‘\n%s:  %s\n\n’ “$(date +%T)” “Begin execution”
srun roberta_large.sh

duration=$SECONDS
echo “$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed.”
EOF

Step 10 - pcluster AMI for RoBERTa

Once you have finished with step 11, go to the EC2 console, and stop your instance.

When the instance changes its state to ‘stopped’, select your instance, click on ‘actions->image->create image’. Set a name for your AMI and click on ‘create imagen’.

Step 11 - Launch a new cluster using ParallelCluster. Assuming you completed the previous parts of this tutorial, you already installed AWS-ParallelCluster.

Modify your pcluster config file, located in ‘$HOME/.parallelcluster/config’ and replace ‘custom_ami’ with your new AMI id.

To enable EFA in your your cluster you must edit your pcluster config file again and use add the following variables:

placement_group=DYNAMIC
enable_efa=compute

If you want to use our prebaked AMI for this tutorial, use the following AMI id:

ami-03d5c1f8c6b62def5 (us-east-1)

Step 12 - Connect to your cluster’s master node using ssh

Step 13 - Run the RoBERTa model

To set the scripts for the cluster you launched, we must consider which instances types you used and how many GPUS do they have.

Open the script ‘roberta.slurm’ and use the following variable to match your cluster properties:

to change number of nodes –nodes=2

Open the script ‘roberta_large.sh’ and use the following variable to match your cluster properties:

p3.2xlarge (1 Tesla V100 16 GB)     WORLD_SIZE=1

p3.8xlarge (4 Tesla V100 16 GB)     WORLD_SIZE=4

p3.16xlarge (8 Tesla V100 16 GB)    WORLD_SIZE=8

p3dn.24xlarge (8 Tesla V100 32 GB)  WORLD_SIZE=8

That number should be multiplied with the number of nodes. World size represents the total amount of GPUS in your cluster.

then just run:

$ sbatch robert.slurm

if you want to see the output of the log:

$ tail -f slurm-2.log (replace 2 for the correct slurm job id)

Step 14 - Results

Here are the words per second (wps) measured in different scenarios:

2 x p3.8xlarge     (8 GPU) token_per_sample=128 update_freq=1   -> 25.500  wps

4 x p3.16xlarge   (32 GPU) token_per_sample=256 update_freq=16  -> 198.000 wps

2 x p3dn.24xlarge (16 GPU) token_per_sample=512 update_freq=16  -> 156.000 wps

4 x p3dn.24xlarge (32 GPU) token_per_sample=512 update_freq=16  -> 310.000 wps

Step 14 - Results

We can see that 4 p3dn.24xlarge making use of the enhanced 100G/s EFA adapter are getting a 50% increase of WPS in comparison with 4 p3.16xlarge (120k) with only a 30% increase in costs. The difference we have with only 4 nodes gets more pronounced as the node count is increased. More communications between the nodes impacts the most on the p3.16large that do not have an EFA adapter.

Getting Help

Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at sales@sixninesit.com.


About Author: Matthew Brucker

All Comments


    Write a Comment

    What do you think?

    Contact Us Today

    For a free consultation.

    Contact Us
    %d bloggers like this: