Cloud INSIDER

Stay connected. Stay informed.
Contact Us

Training a BERT Fine Tuning Model using PyTorch on a single AWS p3 Instance

Introduction

This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch” a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances).  The guide, which is intended for anyone from beginners just starting out, to skilled practitioners,  focuses on deciding the right platform for the machine learning model that you want to deploy. 

The below tutorial is for people who have determined that a single node AWS p3 Instance is right for their machine learning workload. In this tutorial, we will prepare for a BERT fine tuning model.

Prepping the Model

In the previous models, we used ResNet-50, which lends itself to smaller instance types. When using BERT, starting stakes is a single node p3 instance, and increases to multi-node in proportion with the desire for faster results.

As with the previous sections, you will need the following:

Your technology stack will include the following:

  • Model to train: BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
  • Machine learning framework: TensorFlow
  • Distributed training framework: Horovod
  • CPU, GPU (Multi GPU multi node)
  • Instance: p3.2xlarge or greater.
  • AWS ParallelCluster.
  • AMI: we will bake our own.

In this case, instead of using a pre-staged AMI, we will be baking our own to make use of the AWS HPC tool, AWS ParallelCluster. If you want to learn how to build the AMI follow the next section. If you want to use our public baked AMI (ami-0f1d56be258ac95e7 – us-east-1) you can skip the following section.

AMI BAKING

We will need to repeat the steps of training 1 to launch a new EC2 instance (p3.2xlarge). But instead of selecting the deep learning AMI, we will be using this one:

aws-parallelcluster-2.8.1-amzn2-hvm-x86_64-202008022100 – ami-0aa1704150b2ea203 – us-east-1

When the instance is ready, please connect to it as we did in training 1

ssh -i <your .pem filename> ec2-user@<your instance IPv4 Public IP>

This time we will need to install our software stack:

  • CUDA 10.0
  • CUDNN for CUDA 10.0
  • NCCL for CUDA 10.0
  • AWS-ofi-nccl (EFA)
  • Python env for CPU training
  • Python env for GPU training

Step 1 - CUDA 10.0

This ParallelCluster AMI has already the NVIDIA drivers installed but comes with CUDA 11.0. As we will be using tensorflow-gpu 1.14.0, we will need to install CUDA 10.0.

mkdir -p ~/src

cd ~/src

wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux

chmod +x cuda_10.0.130_410.48_linux

sudo cuda_10.0.130_410.48_linux –silent –override –toolkit –samples –toolkitpath=/usr/local/cuda-10.0 –samplespath=/usr/local/cuda –no-opengl-lib

Step 2 - cuDNN for CUDA 10.0

cd ~/src

wget http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/libcudnn7-7.6.5.32-1.cuda10.0.x86_64.rpm

wget http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/libcudnn7-devel-7.6.5.32-1.cuda10.0.x86_64.rpm

sudo yum install ./libcudnn7-7.6.5.32-1.cuda10.0.x86_64.rpm

sudo yum install ./libcudnn7-devel-7.6.5.32-1.cuda10.0.x86_64.rpm

sudo cp /usr/include/cudnn.h /usr/local/cuda-10.0/include

sudo cp /usr/lib/x86_64-linux-gnu/libcudnn* /usr/local/cuda-10.0/lib64

sudo chmod a+r /usr/local/cuda-10.0/include/cudnn.h /usr/local/cuda-10.0/lib64/libcudnn*

Step 3 - NCCL for CUDA 10.0

cd ~/src

git clone https://github.com/NVIDIA/nccl.git

cd nccl

#checkout to nccl v2.6.4-1 (cuda 10.0)

git checkout b221128ecacf4ce1b3054172b9f30163307042c5

sudo make -j$(nproc) src.build CUDA_HOME=/usr/local/cuda-10.0/

sudo make install

export NCCL_HOME=$(pwd)/build

# Install tools to create rpm packages

sudo yum -y install rpm-build rpmdevtools

# Build NCCL rpm package

sudo make pkg.redhat.build

ls build/pkg/rpm/

Step 4 - (OPTIONAL) NCCL tests

cd ~/src

git clone https://github.com/NVIDIA/nccl-tests.git

cd nccl-tests

sudo make

LD_LIBRARY_PATH=/home/ec2-user/src/nccl/build/lib/:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH NCCL_DEBUG=info ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

Step 5 - aws-ofi-nccl

cd /home/centos/src

git clone https://github.com/aws/aws-ofi-nccl.git

cd aws-ofi-nccl

git checkout aws

mkdir -p out

./autogen.sh

./configure –prefix $(pwd)/out –with-libfabric=/opt/amazon/efa/ –with-cuda=/usr/local/cuda-10.0/ –with-nccl=$NCCL_HOME –with-mpi=/opt/amazon/openmpi/

cd tests/

Make

# IF the make of the tests fails, remove them from the root folder Makefile

cd ..

sudo make -j$(nproc)

sudo make install

Step 6 - Python env for CPU training

ip3 install virtualenv

virtualenv cpu-ml -p python3
source cpu-ml/bin/activate

pip install tensorflow==1.14.0

export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH

export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH

HOROVOD_WITH_MPI=1 \

HOROVOD_WITHOUT_GLOO=1 \

HOROVOD_WITH_TENSORFLOW=1 \

HOROVOD_WITHOUT_PYTORCH=1 \

HOROVOD_WITHOUT_MXNET=1 \

pip install horovod –no-cache-dir

source deactivate

Step 7 - Python env for GPU training

virtualenv gpu-ml -p python3

source gpu-ml/bin/activate

pip install tensorflow-gpu==1.14.0    

export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH

export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH

 HOROVOD_WITH_MPI=1 \

    HOROVOD_WITHOUT_GLOO=1 \

    HOROVOD_WITH_TENSORFLOW=1 \

    HOROVOD_WITHOUT_PYTORCH=1 \

    HOROVOD_WITHOUT_MXNET=1 \

    HOROVOD_GPU_OPERATIONS=NCCL \

    HOROVOD_NCCL_INCLUDE=/home/ec2-user/src/nccl/build/include \

    HOROVOD_NCCL_LIB=/home/ec2-user/src/nccl/build/lib \

    pip install horovod –no-cache-dir

When you finish installing the software, you will have to bake your AMI

  1. Go to the AWS console

  2. Go to EC2

  3. select the instance you used to install your software

  4. click on “actions” at the top

  5. select “images”

  6. select “create AMI”

  7. Select your AMI name and select “create image”

8. On the EC2 AMI go to the AMI tab. Wait until your AMI is being created!

STARTING YOUR CLUSTER

Now that the AMI you have created is ready, we will proceed to create a cluster.

(If you choose to use our prebaked AMI, just uncomment it in the following script.)
In your device run:

pip3 install aws-parallelcluster

cat > ~/.parallelcluster/config <<EOF

[aws]

aws_region_name = us-east-1

[global]

cluster_template = default

update_check = true

sanity_check = true

[aliases]

ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster default]

key_name = ML-Benchmark

base_os = alinux

scheduler = slurm

master_instance_type = c5.2xlarge

compute_instance_type = p3.2xlarge

initial_queue_size = 2

maintain_initial_size = true

max_queue_size = 2

vpc_settings = default

custom_ami = <your-custom-ami-from-previous-step> #ami-0f1d56be258ac95e7

master_root_volume_size=1024

compute_root_volume_size=1024

[vpc default]

vpc_id = <your-default-vpc-id in us-east-1>

master_subnet_id = <one-of-your-default-subnets-id in us-east-1>

EOF

We have installed AWS ParallelCluster and set up a config file ready to launch a cluster to run our BERT model. In that confg file you must set information regarding your account:

-AMI id created before

-VPC ID of your default vpc in us-east-1

-any subnet ID of your default vpc in us-east-1

If you decide to launch a bigger cluster (more instances, or bigger instances) you simply change:

-compute_instance_type

-initial_queue_size and max_queue_size

Now, to start your cluster just run:

–pcluster create <your-cluster-name>

When that process finishes, you will find the public ip of your master instance. We have created 3 nodes, a master node, and a worker node. To connect to your cluster, you will have to connect first to the master node.

As you have created your cluster with ParallelCluster, you will find that the cluster has installed and setup SLURM. SLURM is the HPC scheduler we are going to use to run our scripts in multiple nodes.

If you are not using the prebaked AMI, skip to step 5.

Once you have connected to the master node, we need to setup BERT:

Step 1 - Download the source code and the glue dataset:

git clone https://github.com/abditag2/bert

cd bert

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip

unzip uncased_L-12_H-768_A-12.zip

wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py

pip3 install requests

python3 download_glue_data.py –data_dir glue_data –tasks all

Step 2 - Create run_bert_gpu.sh

cd
cat > run_bert_gpu.sh << EOF
#!/bin/bash

source /home/ec2-user/gpu-ml/bin/activate

export BERT_BASE_DIR=/home/ec2-user/bert/uncased_L-12_H-768_A-12

export GLUE_DIR=/home/ec2-user/bert/glue_data

export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH

export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH

python /home/ec2-user/bert/run_classifier.py \

    –task_name=MRPC \

    –do_train=true \

    –do_eval=true \

    –data_dir=$GLUE_DIR/MRPC \

    –vocab_file=$BERT_BASE_DIR/vocab.txt \

    –bert_config_file=$BERT_BASE_DIR/bert_config.json \

    –init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \

    –max_seq_length=128 \

    –train_batch_size=32 \

    –learning_rate=2e-5 \

    –num_train_epochs=4.0 \

    –output_dir=/shared/mrpc_output/ \

    –use_multi_gpu=true

EOF
chmod +x run_bert_gpu.sh

Step 3 - Create bert.slurm

cat > bert.slurm << EOF
#!/bin/bash

#SBATCH –job-name=ml-bert

#SBATCH –nodes=2

#SBATCH –ntasks-per-node=1

SECONDS=0

printf ‘\n%s:  %s\n\n’ “$(date +%T)” “Begin execution”

export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH

export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH

mpirun -np 2 \

   -bind-to none -map-by slot \

   -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \

   -mca pml ob1 -mca btl ^openib \

   run_bert_gpu.sh

duration=$SECONDS

echo “$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed.”

printf ‘%s:  %s\n\n’ “$(date +%T)” “End execution”

Step 4 - Create run_bert_cpu.sh

cat > run_bert_cpu.sh << EOF

#!/bin/bash

source /home/ec2-user/cpu-ml/bin/activate

export BERT_BASE_DIR=/home/ec2-user/bert/uncased_L-12_H-768_A-12

export GLUE_DIR=/home/ec2-user/bert/glue_data

export PATH=/opt/amazon/openmpi/bin/:/usr/local/cuda-10.0/bin:$PATH

export LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda-10.0/lib64:/home/ec2-user/src/nccl/build/lib:$LD_LIBRARY_PATH

python /home/ec2-user/bert/run_classifier.py \

   –task_name=MRPC \

   –do_train=true \

   –do_eval=true \

   –data_dir=$GLUE_DIR/MRPC \

   –vocab_file=$BERT_BASE_DIR/vocab.txt \

   –bert_config_file=$BERT_BASE_DIR/bert_config.json \

   –init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \

   –max_seq_length=128 \

   –train_batch_size=32 \

   –learning_rate=2e-5 \

   –num_train_epochs=4.0 \

   –output_dir=/shared/mrpc_output/

EOF
chmod +x run_bert_cpu.sh

Step 5 - On the Master Node:

nohup ~/run_bert_cpu.sh &

tail -f nohup.out

to run bert GPU 2 nodes 2 GPU

sbatch bert.slurm

Change parameters in bert.slurm 

To change GPU numbers per node change the flag –ntasks-per-node=1

To change number of nodes –nodes=2

mpirun -np 2 is –nodes * –ntasks-per-node=

To check the results of sbatch bert.slurm, run:

tail -f slurm-2.out

To make use of AWS EFA.

For our model training to take advantage of EFA we need to launch a cluster using AWS ParallelCluster using EFA enabled instances.  For this we will use the p3dn.24xlarge instance as worker nodes and use c5n.18xlarge as master node.

cat > ~/.parallelcluster/config <<EOF

[aws]

aws_region_name = us-east-1

[global]

cluster_template = default

update_check = true

sanity_check = true

[aliases]

ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster default]

key_name = ML-Benchmark

base_os = alinux

scheduler = slurm

master_instance_type = c5n.18xlarge

compute_instance_type = p3dn.24xlarge

initial_queue_size = 2

maintain_initial_size = true

max_queue_size = 2

vpc_settings = default

custom_ami = <your-custom-ami-from-previous-step>

master_root_volume_size=1024

compute_root_volume_size=1024
enable_efa = compute
placement_group = DYNAMIC


[vpc default]

vpc_id = <your-default-vpc-id in us-east-1>

master_subnet_id = <one-of-your-default-subnets-id in us-east-1>

EOF

Now just to launch your EFA-enabled cluster just:

pcluster create efa-cluster

Once you have access to your master node, you can test if EFA is enabled just running:

/opt/amazon/efa/bin/fi_info -p efa

Getting Help

Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at sales@sixninesit.com.


About Author: Matthew Brucker

All Comments


    Write a Comment

    What do you think?

    Contact Us Today

    For a free consultation.

    Contact Us
    %d bloggers like this: