Training a BERT Fine Tuning Model using PyTorch on a single AWS p3 Instance
Introduction
This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch” a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances). The guide, which is intended for anyone from beginners just starting out, to skilled practitioners, focuses on deciding the right platform for the machine learning model that you want to deploy.
The below tutorial is for people who have determined that a single node AWS p3 Instance is right for their machine learning workload. In this tutorial, we will prepare for a BERT fine tuning model.
Prepping the Model
In the previous models, we used ResNet-50, which lends itself to smaller instance types. When using BERT, starting stakes is a single node p3 instance, and increases to multi-node in proportion with the desire for faster results.
As with the previous sections, you will need the following:
- An AWS Account with console and programmatic access
- An AWS Command Line Interface (CLI) with credentials set up
Your technology stack will include the following:
- Model to train: BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
- Machine learning framework: TensorFlow
- Distributed training framework: Horovod
- CPU, GPU (Multi GPU multi node)
- Instance: p3.2xlarge or greater.
- AWS ParallelCluster.
- AMI: we will bake our own.
In this case, instead of using a pre-staged AMI, we will be baking our own to make use of the AWS HPC tool, AWS ParallelCluster. If you want to learn how to build the AMI follow the next section. If you want to use our public baked AMI (ami-0f1d56be258ac95e7 – us-east-1) you can skip the following section.
AMI BAKING
We will need to repeat the steps of training 1 to launch a new EC2 instance (p3.2xlarge). But instead of selecting the deep learning AMI, we will be using this one:

aws-parallelcluster-2.8.1-amzn2-hvm-x86_64-202008022100 – ami-0aa1704150b2ea203 – us-east-1
When the instance is ready, please connect to it as we did in training 1
ssh -i <your .pem filename> ec2-user@<your instance IPv4 Public IP>
This time we will need to install our software stack:
- CUDA 10.0
- CUDNN for CUDA 10.0
- NCCL for CUDA 10.0
- AWS-ofi-nccl (EFA)
- Python env for CPU training
- Python env for GPU training
Step 1 - CUDA 10.0
This ParallelCluster AMI has already the NVIDIA drivers installed but comes with CUDA 11.0. As we will be using tensorflow-gpu 1.14.0, we will need to install CUDA 10.0.
mkdir -p ~/src
cd ~/src
wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
chmod +x cuda_10.0.130_410.48_linux
sudo cuda_10.0.130_410.48_linux –silent –override –toolkit –samples –toolkitpath=/usr/local/cuda-10.0 –samplespath=/usr/local/cuda –no-opengl-lib
Step 2 - cuDNN for CUDA 10.0
cd ~/src
wget http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/libcudnn7-7.6.5.32-1.cuda10.0.x86_64.rpm
wget http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/libcudnn7-devel-7.6.5.32-1.cuda10.0.x86_64.rpm
sudo yum install ./libcudnn7-7.6.5.32-1.cuda10.0.x86_64.rpm
sudo yum install ./libcudnn7-devel-7.6.5.32-1.cuda10.0.x86_64.rpm
sudo cp /usr/include/cudnn.h /usr/local/cuda-10.0/include
sudo cp /usr/lib/x86_64-linux-gnu/libcudnn* /usr/local/cuda-10.0/lib64
sudo chmod a+r /usr/local/cuda-10.0/include/cudnn.h /usr/local/cuda-10.0/lib64/libcudnn*
Step 3 - NCCL for CUDA 10.0
cd ~/src
git clone https://github.com/NVIDIA/nccl.git
cd nccl
#checkout to nccl v2.6.4-1 (cuda 10.0)
git checkout b221128ecacf4ce1b3054172b9f30163307042c5
sudo make -j$(nproc) src.build CUDA_HOME=/usr/local/cuda-10.0/
sudo make install
export NCCL_HOME=$(pwd)/build
# Install tools to create rpm packages
sudo yum -y install rpm-build rpmdevtools
# Build NCCL rpm package
sudo make pkg.redhat.build
ls build/pkg/rpm/
Step 4 - (OPTIONAL) NCCL tests
cd ~/src
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
sudo make
LD_LIBRARY_PATH=/home/ec2-user/src/nccl/build/lib/:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH NCCL_DEBUG=info ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
Step 5 - aws-ofi-nccl
cd /home/centos/src
git clone https://github.com/aws/aws-ofi-nccl.git
cd aws-ofi-nccl
git checkout aws
mkdir -p out
./autogen.sh
./configure –prefix $(pwd)/out –with-libfabric=/opt/amazon/efa/ –with-cuda=/usr/local/cuda-10.0/ –with-nccl=$NCCL_HOME –with-mpi=/opt/amazon/openmpi/
cd tests/
Make
# IF the make of the tests fails, remove them from the root folder Makefile
cd ..
sudo make -j$(nproc)
sudo make install
Step 6 - Python env for CPU training
ip3 install virtualenv
virtualenv cpu-ml -p python3
source cpu-ml/bin/activate
pip install tensorflow==1.14.0
export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH
HOROVOD_WITH_MPI=1 \
HOROVOD_WITHOUT_GLOO=1 \
HOROVOD_WITH_TENSORFLOW=1 \
HOROVOD_WITHOUT_PYTORCH=1 \
HOROVOD_WITHOUT_MXNET=1 \
pip install horovod –no-cache-dir
source deactivate
Step 7 - Python env for GPU training
virtualenv gpu-ml -p python3
source gpu-ml/bin/activate
pip install tensorflow-gpu==1.14.0
export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH
HOROVOD_WITH_MPI=1 \
HOROVOD_WITHOUT_GLOO=1 \
HOROVOD_WITH_TENSORFLOW=1 \
HOROVOD_WITHOUT_PYTORCH=1 \
HOROVOD_WITHOUT_MXNET=1 \
HOROVOD_GPU_OPERATIONS=NCCL \
HOROVOD_NCCL_INCLUDE=/home/ec2-user/src/nccl/build/include \
HOROVOD_NCCL_LIB=/home/ec2-user/src/nccl/build/lib \
pip install horovod –no-cache-dir
When you finish installing the software, you will have to bake your AMI
Go to the AWS console
Go to EC2
select the instance you used to install your software
click on “actions” at the top
select “images”
select “create AMI”
Select your AMI name and select “create image”

8. On the EC2 AMI go to the AMI tab. Wait until your AMI is being created!
STARTING YOUR CLUSTER
Now that the AMI you have created is ready, we will proceed to create a cluster.
(If you choose to use our prebaked AMI, just uncomment it in the following script.)
In your device run:
pip3 install aws-parallelcluster
cat > ~/.parallelcluster/config <<EOF
[aws]
aws_region_name = us-east-1
[global]
cluster_template = default
update_check = true
sanity_check = true
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[cluster default]
key_name = ML-Benchmark
base_os = alinux
scheduler = slurm
master_instance_type = c5.2xlarge
compute_instance_type = p3.2xlarge
initial_queue_size = 2
maintain_initial_size = true
max_queue_size = 2
vpc_settings = default
custom_ami = <your-custom-ami-from-previous-step> #ami-0f1d56be258ac95e7
master_root_volume_size=1024
compute_root_volume_size=1024
[vpc default]
vpc_id = <your-default-vpc-id in us-east-1>
master_subnet_id = <one-of-your-default-subnets-id in us-east-1>
EOF
We have installed AWS ParallelCluster and set up a config file ready to launch a cluster to run our BERT model. In that confg file you must set information regarding your account:
-AMI id created before
-VPC ID of your default vpc in us-east-1
-any subnet ID of your default vpc in us-east-1
If you decide to launch a bigger cluster (more instances, or bigger instances) you simply change:
-compute_instance_type
-initial_queue_size and max_queue_size
Now, to start your cluster just run:
–pcluster create <your-cluster-name>
When that process finishes, you will find the public ip of your master instance. We have created 3 nodes, a master node, and a worker node. To connect to your cluster, you will have to connect first to the master node.
As you have created your cluster with ParallelCluster, you will find that the cluster has installed and setup SLURM. SLURM is the HPC scheduler we are going to use to run our scripts in multiple nodes.
If you are not using the prebaked AMI, skip to step 5.
Once you have connected to the master node, we need to setup BERT:
Step 1 - Download the source code and the glue dataset:
git clone https://github.com/abditag2/bert
cd bert
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
pip3 install requests
python3 download_glue_data.py –data_dir glue_data –tasks all
Step 2 - Create run_bert_gpu.sh
cd
cat > run_bert_gpu.sh << EOF
#!/bin/bash
source /home/ec2-user/gpu-ml/bin/activate
export BERT_BASE_DIR=/home/ec2-user/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/home/ec2-user/bert/glue_data
export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH
python /home/ec2-user/bert/run_classifier.py \
–task_name=MRPC \
–do_train=true \
–do_eval=true \
–data_dir=$GLUE_DIR/MRPC \
–vocab_file=$BERT_BASE_DIR/vocab.txt \
–bert_config_file=$BERT_BASE_DIR/bert_config.json \
–init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
–max_seq_length=128 \
–train_batch_size=32 \
–learning_rate=2e-5 \
–num_train_epochs=4.0 \
–output_dir=/shared/mrpc_output/ \
–use_multi_gpu=true
EOF
chmod +x run_bert_gpu.sh
Step 3 - Create bert.slurm
cat > bert.slurm << EOF
#!/bin/bash
#SBATCH –job-name=ml-bert
#SBATCH –nodes=2
#SBATCH –ntasks-per-node=1
SECONDS=0
printf ‘\n%s: %s\n\n’ “$(date +%T)” “Begin execution”
export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH
mpirun -np 2 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
run_bert_gpu.sh
duration=$SECONDS
echo “$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed.”
printf ‘%s: %s\n\n’ “$(date +%T)” “End execution”
Step 4 - Create run_bert_cpu.sh
cat > run_bert_cpu.sh << EOF
#!/bin/bash
source /home/ec2-user/cpu-ml/bin/activate
export BERT_BASE_DIR=/home/ec2-user/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/home/ec2-user/bert/glue_data
export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH
python /home/ec2-user/bert/run_classifier.py \
–task_name=MRPC \
–do_train=true \
–do_eval=true \
–data_dir=$GLUE_DIR/MRPC \
–vocab_file=$BERT_BASE_DIR/vocab.txt \
–bert_config_file=$BERT_BASE_DIR/bert_config.json \
–init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
–max_seq_length=128 \
–train_batch_size=32 \
–learning_rate=2e-5 \
–num_train_epochs=4.0 \
–output_dir=/shared/mrpc_output/
EOF
chmod +x run_bert_cpu.sh
Step 5 - On the Master Node:
nohup ~/run_bert_cpu.sh &
tail -f nohup.out
to run bert GPU 2 nodes 2 GPU
sbatch bert.slurm
Change parameters in bert.slurm
To change GPU numbers per node change the flag –ntasks-per-node=1
To change number of nodes –nodes=2
mpirun -np 2 is –nodes * –ntasks-per-node=
To check the results of sbatch bert.slurm, run:
tail -f slurm-2.out
To make use of AWS EFA.

For our model training to take advantage of EFA we need to launch a cluster using AWS ParallelCluster using EFA enabled instances. For this we will use the p3dn.24xlarge instance as worker nodes and use c5n.18xlarge as master node.
cat > ~/.parallelcluster/config <<EOF
[aws]
aws_region_name = us-east-1
[global]
cluster_template = default
update_check = true
sanity_check = true
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[cluster default]
key_name = ML-Benchmark
base_os = alinux
scheduler = slurm
master_instance_type = c5n.18xlarge
compute_instance_type = p3dn.24xlarge
initial_queue_size = 2
maintain_initial_size = true
max_queue_size = 2
vpc_settings = default
custom_ami = <your-custom-ami-from-previous-step>
master_root_volume_size=1024
compute_root_volume_size=1024
enable_efa = compute
placement_group = DYNAMIC
[vpc default]
vpc_id = <your-default-vpc-id in us-east-1>
master_subnet_id = <one-of-your-default-subnets-id in us-east-1>
EOF
Now just to launch your EFA-enabled cluster just:
pcluster create efa-cluster
Once you have access to your master node, you can test if EFA is enabled just running:
/opt/amazon/efa/bin/fi_info -p efa
Getting Help
Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at sales@sixninesit.com.
All Comments
Write a Comment