Multi-node LLM Training on AMD GPUs

TL;DR

  • Pushing the limits of training LLMs on enterprise data scales.
  • Multi-node training enables data parallelism which speeds up the training across multiple nodes and GPUs.
  • Can't wait to make your training 1000x or even 10,000x faster? Please contact us at info@lamini.ai
  • We're hiring! Join us: https://jobs.lever.co/laminiai

Training an LLM on an enterprise's private data can take months or years.

You are an executive of a Fortune 500 company or government. Recently, one of your competitors released an LLM, while another announced its LLM roadmap. You can't fall behind, so you hired a Head of AI and assembled a top-notch engineering team to develop your own LLM.

After weeks of intensive learning about LLM development and curating a massive training dataset comprising millions or even billions of data points, it was time to train.

However, after one week, the training progress had only reached 2% - far behind the projected timeline. Your team has worked tirelessly, but they couldn't make it faster. Will it ever finish? You worry about losing ground and jeopardizing your business.

We have the answer—training 1000x or even 10,000x faster with multi-node training. In this blog post, we’ll go through the challenges and process of setting up multi-node training on AMD GPUs.

Training on multiple nodes (each has 8 GPUs), showing the MPI ranks

Enabling Data Parallelism with multi-node training, scale from 1 to 1000s of GPUs

Data parallelism is a paradigm that accelerates the training of large models by sharding the data across multiple compute nodes and GPUs.

We use multi-node training to enable data parallelism. Each GPU hosts its own replica of the model, which is fed a different shard of the data. During the forward and backward passes, the samples in each batch are processed independently by the respective GPU. However, before updating the weights and optimizer states, the gradients and losses computed by each GPU need to be synchronized across all GPUs.

For this synchronization step, you can use popular collective algorithms like AllReduce. In our setup at Lamini, we leverage the Ring AllReduce algorithm via our own collective communication library built on top of rCCL and MPI to efficiently synchronize the gradients and losses across multiple nodes and GPUs.

How can a ring be parallel?  Aren’t trees more efficient?  AllReduce is a collective operation in MPI, which has been optimized on supercomputers and researched intensely for decades.  MPI_Reduce computes a reduction, e.g., a SUM or a MEAN, across values stored on each MPI rank. An MPI_AllReduce computes the reduction and also broadcasts the result to every MPI rank.  

A simple AllReduce algorithm starts with each rank sending its data to the main rank 0, which performs the reduction.  This doesn’t scale well because all of the data traverses the link connecting to rank 0, creating a bottleneck that gets worse as more ranks are added.  In a Ring AllReduce, each rank sends data to its neighbor, which reduces it with its local copy.  Once the data is sent all of the way around the ring twice, each node has a fully reduced copy.  The data crosses each link only twice, instead of N times for N ranks in the naive algorithm.  The figure shows the ring algorithm and all of the links being simultaneously used in parallel.  Ironically, a ring, which seems sequential, is using all links in parallel.  It’s link level parallelism instead of GPU parallelism.

Challenges to set up multi-node training on AMD GPUs in production

As those who have stayed up all night for three days straight looking at MPI init logs know, enabling multi-node data-parallel training on AMD GPUs has some unique challenges. This complexity is not necessary. Data parallelism is conceptually simple: The LLM is replicated on each GPU, which exchanges their gradients after each training step. The gradient exchange can be implemented by a single operation - AllReduce (an MPI collective) - as shown in the pseudo-code below:

for param in model.parameters():
             allreduce(param.grad.data)

While data parallelism may seem straightforward, we have found that existing frameworks such as DeepSpeed, Megatron-LM, and PyTorch Distribute abstract away the core collective operations and network software stack, complicating performance optimization and debugging.

In this blog post, we peel back the abstractions to explain how to build a simple, high-performance, and fully functional data-parallel training stack on AMD GPUs.

Lamini’s Technical Stack for Multi-node Training on AMD GPUs

Showing an example of the stack, 2 nodes, 4 GPUs each

Although the AllReduce operation is conceptually simple, the software stack is highly complex. The above figure shows some of the layers used to implement data parallelism.  At Lamini, we have needed to modify every single layer to ensure that distributed training works reliably and with high performance.  Here, we describe several of the important layers of the stack.

AMD GPUs

For training, we use a cluster of AMD Instinct MI250 or MI300X nodes. Consider MI250.  Each node has 8 GPUs and 64 GB HBM.  The GPUs within a node are interconnected by infinity fabric links running at 100 GB/s.  The intra-node topology supports two ring networks running at 100 GB/s bidirectional.  This enables AllReduce within the node of a 7B parameter LLM in less than 100ms in fp32. The GPUs perform the bulk of the forward and backward propagation for model training using kernels in rocBLAS.  AMD also includes an optimized collective library rCCL that is API compatible with nCCL.

SLURM

Lamini uses slurm for our training-job scheduling and resource allocation.SLURM (the Simple Linux Utility for Resource Management - definitely not the fictional intergalactic beverage from Futurama) is a popular open-source job scheduling system used in high-performance computing (HPC) environments. It is designed to manage and allocate resources such as GPU, CPU, memory, and storage to multiple jobs running on a cluster of computers.

SLURM provides a simple and flexible interface for job submission, monitoring, and management. It allows users to submit jobs in batch mode, where jobs are queued and executed in the order they are received, or interactive mode, where users can submit and monitor jobs interactively.

SLURM automatically partitions the tasks across the available nodes in the cluster and manages the communication between the tasks using MPI. This allows you to run large-scale MPI jobs on a cluster of computers, taking advantage of the parallel processing power to speed up the computation.

SLURM supports a wide range of features, including: Job dependency management, resource allocation, job prioritization, monitoring and cancellation.

Lamini Scheduler

Jobs that are submitted to Lamini are sized appropriately for the target machine.  Training a small model on a small amount of data can be completed quickly on a single GPU, so Lamini will push a small job into the SLURM queue.  However, a very large model trained on a large amount of data may take weeks or months to complete on a single GPU, so Lamini will push a large job with many nodes working together into the SLURM queue.  The Lamini scheduler also includes quality of service guarantees in scheduling so that customers who have dedicated GPU resources get access to them quickly.

Hugging Face/PyTorch

The Lamini training pipeline integrates with the model loaders provided by Hugging Face so that we can support any Hugging Face model. We have replaced the Hugging Face Trainer code with our custom implementation of a distributed model training loop with optimized collectives.

Lamini Collectives

Lamini models use low precision and entropy based compression of the gradients before they are passed off to the allreduce collectives. They also select the appropriate collective topology for the deployed machine.

And of course, there's more!

How we deploy multi-node training on AMD GPUs

Showing an example of the stack, 2 nodes, 4 GPUs each

Lamini packages the entire distributed training stack (ROCm, rCCL, MPI, UCX, SLURM, MUNGE, PMI, PyTorch, Lamini Trainer, etc) into a docker container shown as a finetuning worker in the above figure. The finetuning worker is scaled horizontally on GPU servers. The finetuning workers run the slurmd process to join the cluster.  A control plane runs the Lamini Scheduler which integrates with slurmctld to manage the training job queue and launch jobs on the finetuning workers.

Lamini includes software to automatically configure the components including the network transport layer and network interfaces.

If it’s not tested, it doesn’t work. Lamini also runs an extensive test suite on the complete build.  Test coverage includes:

  • Enumerating the GPUs
  • Enumerating the finetuning workers
  • Verifying SLURM GRES
  • Verifying the network interfaces
  • Verifying the network topology
  • Simple MPI job launch
  • Communication between simple MPI ranks
  • Latency and bandwidth tests on MPI links
  • Collective functional tests
  • Collective performance tests
  • Scheduling and prioritization
  • QoS enforcement
  • Checkpoint and restart of training jobs
  • Model distribution among ranks
  • Model training on a single GPU
  • Model training on multiple GPUs in the same node
  • Model training on multiple GPUs on multiple nodes
  • Data partitioning among multiple nodes

Lamini is developed with continuous integration and continuous deployments with weekly updates that automatically deploy to our AMD cloud.  

We have found that the distributed training stack is so complex to build, configure, and deploy that we need a relatively large continuous testing cluster to ensure that any configuration or build change, e.g. bumping a driver version, ROCm library version, Hugging Face loader version, etc passes the test suite.  Some issues only show up at scale - especially connectivity, synchronization, checkpointing, and performance issues between multiple GPUs across different nodes and the network.

Train an LLM 10,000x faster

Now that we have this capability on AMD GPUs - you can train 1000x or even 10,000x faster. Happy training! Have questions? Please contact us at info@lamini.ai.

Also, we're hiring. Join us to invent and build the world’s largest LLM training system!

--

Ayushi Sharma
Founding Engineer, Lamini

Greg Diamos
Co-Founder & CTO, Lamini

March 14, 2024

Lamini