Stay connected. Stay informed.
Contact Us

Training a BERT Fine Tuning Model using PyTorch on multiple AWS p3 Instances


This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch”, a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances).  The guide, which is intended for anyone from beginners just starting out, to skilled practitioners,  focuses on deciding the right platform for the machine learning model that you want to deploy. 

The below tutorial is for people who have determined that a multi-node AWS p3 Instance is right for their machine learning workload. In this tutorial, we will prepare for a BERT fine tuning model.

Prepping the Model

In this tutorial we will be focusing on a large model and we will be making use of amazon EFA to accelerate distributed training. We will be using a stack of p3dn instances to leverage EFA and 8 GPUs per instance.

We will be using RoBERTa, RoBERTa iterates on BERT’s pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. 

As with the previous sections, you will need the following:

To pretrain RoBERTa we are going to use the following stack:

  • ParallelCluster 2.8.1 aLinux 2 base AMI.
  • CUDA 10.1
  • PyTorch
  • Fairseq
  • WikiText-103 dataset

Note: If you do not want to bake your own AMI and use our prebaked AMI, skip ahead to step 11.

Step 1- Launch a p3.2xlarge instance with the ParallelCluster 2.8.1 aLinux AMI, ami-0aa1704150b2ea203 for us-east-1

Step 2 - Connect to the instance through ssh

Step 3 - Install CUDA 10.1

Mkdir -p $HOME/src
cd $HOME/src
chmod +x
sudo ./ –silent –override –toolkit –samples –toolkitpath=/usr/local/cuda-10.1 –samplespath=/usr/local/cuda
ln -s /usr/local/cuda-10.1 /usr/local/cuda

Step 4 - Install cuDNN

cd $HOME/src
sudo yum install ./libcudnn7-
sudo yum install ./libcudnn7-devel-
sudo cp /usr/include/cudnn.h /usr/local/cuda/include
sudo cp /usr/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Step 5 - Install nccl, nccl-tests, aws-ofi-nccl

cd $HOME/src
git clone
cd nccl/
git checkout 3701130b3c1bcdb01c14b3cb70fe52498c1e82b7
sudo make -j$(nproc) CUDA_HOME=/usr/local/cuda/
sudo make install
export NCCL_HOME=$(pwd)/build
sudo yum -y install rpm-build rpmdevtools
sudo make


cd $HOME/src
git clone
cd nccl-tests/
sudo make
LD_LIBRARY_PATH=/home/ec2-user/src/nccl/build/lib/:/usr/local/cuda/lib64:$LD_LIBRARY_PATH NCCL_DEBUG=info ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

cd $HOME/src
git clone
cd aws-ofi-nccl
checkout 48a17d3e1e7a2b22c8c32c864541126ff4a995c7
mkdir -p out
./configure –prefix $(pwd)/out –with-libfabric=/opt/amazon/efa/ –with-cuda=/usr/local/cuda/ –with-nccl=$NCCL_HOME –with-mpi=/opt/amazon/openmpi/
sudo make -j$(nproc)

Step 6 - Save AMI

Once you have finished with step 5, go to the EC2 console, and follow the steps to create an AMI. Your ssh connection is going to be closed, wait a few seconds to connect again to your instance.

Select your instance, click on ‘actions->image->create image’. Set a name for your AMI and click on ‘create imagen’.

You have now set up the basics to use most frameworks alongside AWS ParallelCluster, CUDA 10.1, NCCL and the AWS EFA NCCL plugin.

We will use this as a checkpoint. The next tutorial will be using this AMI as starting point.

Step 7 - Installing packages (PyTorch, Fairseq, Apex, PyArrow)

sudo yum install python3 python3-devel
pip3 install virtualenv –user
mkdir -p $HOME/src
cd $HOME/src
virtualenv ml
source ml/bin/activate
pip install torch==1.4.0
git clone
cd fairseq
pip install –editable ./
cd ..
git clone
cd apex
pip install -v –no-cache-dir –global-option=”–cpp_ext” –global-option=”–cuda_ext”   –global-option=”–deprecated_fused_adam” –global-option=”–xentropy”   –global-option=”–fast_multihead_attn” ./
pip install pyarrow
cd $HOME

Step 8 - Getting the dataset

mkdir -p data

cd $HOME/src/fairseq/examples/language_model/

chmod +x

#After this script runs, you should have a ‘data-bin’ folder in $HOME/data

Step 9 - Run Files

cat > <<EOF

source /home/ec2-user/src/ml/bin/activate

export FI_PROVIDER=efa

#WORLD_SIZE represents the GPUS available in your cluster. For 2 p3.2xlarge, the value is 8.

# for P3.2xlarge

python /home/ec2-user/src/fairseq/ $DATABIN –save-dir $OUTDIR \
  –memory-efficient-fp16 \
  –fast-stat-sync \
  –num-workers 2 \
  –task masked_lm \
  –criterion masked_lm \
  –arch roberta_large \
  –sample-break-mode complete \
  –tokens-per-sample $TOKENS_PER_SAMPLE \
  –optimizer adam –adam-betas ‘(0.9, 0.98)’ –adam-eps 1e-06 –clip-norm 0.0 \
  –lr 0.0006 –lr-scheduler polynomial_decay –warmup-updates 24000 \
  –total-num-update $TOTAL_UPDATE \
  –max-update $TOTAL_UPDATE \
  –dropout 0.1 –attention-dropout 0.1 –weight-decay 0.01 \
  –max-sentences $MAX_SENTENCES \
  –update-freq $UPDATE_FREQ \
  –skip-invalid-size-inputs-valid-test \
  –seed 1 \
  –log-format json –log-interval 25 \
  –distributed-world-size $WORLD_SIZE –distributed-port $DIST_PORT –bucket-cap-mb $BUCKET_CAP_MB 2>&1 | tee $OUTDIR/train.${SLURM_NODEID}.log

chmod +x

#edit nodes if you launch a cluster with more nodes. In this example ntasks-per-node is only one.
cat > roberta.slurm <<EOF
#SBATCH –job-name=roberta_large
#SBATCH –nodes=2
#SBATCH –ntasks-per-node=1

printf ‘\n%s:  %s\n\n’ “$(date +%T)” “Begin execution”

echo “$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed.”

Step 10 - pcluster AMI for RoBERTa

Once you have finished with step 11, go to the EC2 console, and stop your instance.

When the instance changes its state to ‘stopped’, select your instance, click on ‘actions->image->create image’. Set a name for your AMI and click on ‘create imagen’.

Step 11 - Launch a new cluster using ParallelCluster. Assuming you completed the previous parts of this tutorial, you already installed AWS-ParallelCluster.

Modify your pcluster config file, located in ‘$HOME/.parallelcluster/config’ and replace ‘custom_ami’ with your new AMI id.

To enable EFA in your your cluster you must edit your pcluster config file again and use add the following variables:


If you want to use our prebaked AMI for this tutorial, use the following AMI id:

ami-03d5c1f8c6b62def5 (us-east-1)

Step 12 - Connect to your cluster’s master node using ssh

Step 13 - Run the RoBERTa model

To set the scripts for the cluster you launched, we must consider which instances types you used and how many GPUS do they have.

Open the script ‘roberta.slurm’ and use the following variable to match your cluster properties:

to change number of nodes –nodes=2

Open the script ‘’ and use the following variable to match your cluster properties:

p3.2xlarge (1 Tesla V100 16 GB)     WORLD_SIZE=1

p3.8xlarge (4 Tesla V100 16 GB)     WORLD_SIZE=4

p3.16xlarge (8 Tesla V100 16 GB)    WORLD_SIZE=8

p3dn.24xlarge (8 Tesla V100 32 GB)  WORLD_SIZE=8

That number should be multiplied with the number of nodes. World size represents the total amount of GPUS in your cluster.

then just run:

$ sbatch robert.slurm

if you want to see the output of the log:

$ tail -f slurm-2.log (replace 2 for the correct slurm job id)

Step 14 - Results

Here are the words per second (wps) measured in different scenarios:

2 x p3.8xlarge     (8 GPU) token_per_sample=128 update_freq=1   -> 25.500  wps

4 x p3.16xlarge   (32 GPU) token_per_sample=256 update_freq=16  -> 198.000 wps

2 x p3dn.24xlarge (16 GPU) token_per_sample=512 update_freq=16  -> 156.000 wps

4 x p3dn.24xlarge (32 GPU) token_per_sample=512 update_freq=16  -> 310.000 wps

Step 14 - Results

We can see that 4 p3dn.24xlarge making use of the enhanced 100G/s EFA adapter are getting a 50% increase of WPS in comparison with 4 p3.16xlarge (120k) with only a 30% increase in costs. The difference we have with only 4 nodes gets more pronounced as the node count is increased. More communications between the nodes impacts the most on the p3.16large that do not have an EFA adapter.

Getting Help

Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at

About Author: Matthew Brucker

All Comments

    Write a Comment

    What do you think?

    Contact Us Today

    For a free consultation.

    Contact Us
    %d bloggers like this: