Lamini LLM Finetuning on AMD ROCm™: A Technical Recipe

ROCm™, AMD’s open software AI stack, is production-ready. We believe it has enormous potential to accelerate AI advancement to a similar or even greater degree than CUDA for large language model (LLM) finetuning and beyond.

In the announcement of our collaboration with AMD, we launched LLM Superstation—an optimized finetuning supercomputer, integrating 128 AMD Instinct™ GPUs running Lamini on top of the AMD ROCm open software ecosystem. Lamini and ROCm have reached sufficient maturity, enabling efficient finetuning of the largest LLMs, such as Meta AI’s Llama 2. If you’d like to learn more about finetuning, you can read our previous blog posts or take our finetuning course.

In this blog post, we’d like to show you how we built an optimized finetuning system using Lamini on AMD Instinct GPUs. There were several technical challenges:

  • Building LLM Superstation on AMD Instinct GPUs
  • Building an AMD LLM Software Stack
  • Optimizing LLM Performance

Building LLM Superstation on AMD Instinct GPUs

Lamini makes AMD Instinct GPUs available through the LLM Superstation in both desktop and rack-mount server configurations. We also provide a hosted cloud of MI servers with Lamini pre-installed.

ROCm has historically only been supported on AMD’s Instinct GPUs, not consumer Radeon GPUs, which is easier to get than the former. This differs from CUDA’s ubiquity across NVIDIA’s product stack. AMD recently announced a "ROCm on Radeon" initiative to address this challenge, extending support to AMD Radeon RX 7900 XTX and Radeon PRO W7900 GPUs.

Having worked with CUDA and experiencing the difficulty of getting an entire deep learning framework running, I wanted to take a deeper look into ROCm as an alternative. This proved challenging initially. After much effort and finally obtaining AMD Instinct MI100 engineering samples from a professor in UK, we finally had a system up that could run reliably, unlocking our LLM development.

We started with a single MI100 system in December 2022. AMD leadership generously donated two MI210 servers. As a pre-seed startup, we set up a cost-effective server room by adapting a janitor's closet with ventilation and sound insulation.

As demand grew, we expanded into a data center at CoreSite. Our typical configuration includes 4 x MI250 GPU servers, providing 512GB of HBM, which can fit a ~200B parameter LLM in bfloat16. We use shared NFS servers, about 400TB, and a high-performance network between GPU servers. Looking ahead, the upcoming AMD Instinct MI300X GPUs with 192GB of HBM will allow us to scale even further.

Building an AMD LLM Software Stack

Today, we are rolling out phase two of our finetuning system—integrating 128 MI200 GPUs to power the Lamini platform.

Training LLMs is notoriously computationally intensive, as demonstrated in the MLPerf LLM benchmark. The largest foundation model training systems use over 10,000 GPUs for LLM training. While still demanding, finetuning LLMs requires less compute and can be done on a much smaller scale, even with a single GPU. To enable training on clusters with hundreds of AMD Instinct GPUs, we leverage several layers of specialized software inside ROCm and built additional new layers.  

The first layer is the GPU itself. AMD and NVIDIA GPUs were both developed to run graphics applications like OpenGL or DirectX shaders. The CDNA architecture is a massively parallel array of compute units with highly threaded SIMD processors that share register files, scratchpad local data storage, and L1 caches.

The SIMD processors connect to a memory hierarchy of L2 caches joined by an on-chip interconnect. On one side of the interconnect are HBM memory controllers, and on the other side are the compute units. Programs use a bulk synchronous programming model, launching a massive array of threads over the GPU, scheduling onto SIMD processors, performing local computation, and communicating through the memory hierarchy using synchronizations such as barriers.

To accelerate neural network operations, compute units include matrix cores, which perform tiled matrix operations, e.g., multiplication of bfloat16 inputs and accumulation into float32 outputs directly in the datapath. You can program the GPU directly at this layer using the CDNA instruction set, which, unlike NVIDIA, is open source.

This style of architecture is different from systolic accelerators like TPUs and dataflow architectures. In particular, their memory systems support loads, stores, and memcpy. They don’t rely on magic compilers or ninja programmers to move data around to the right place at the right time. Programmers can write complete and short kernels in GPU assembly that fully utilize the GPU for common operations like GEMM, which make up the bulk of the computation for deep neural networks.

The next layer includes a Linux kernel driver. The AMDgpu Linux driver exposes the GPUs to Linux, performing functions such as memory management, job scheduling, data transfer, virtualization, and resource allocation. This driver builds on the Radeon GPU driver. It is underappreciated how much software engineering work is required to build a fully functional, high-performance, and validated driver. It requires handling low-level details like setting up page tables and programming hardware memory management units (e.g. this article describes IOMMU programming).

In addition to the GPU driver, compilers are important to generate efficient GPU code. AMD compilers utilize the AMD GPU LLVM backend, which optimizes code from the perspective of a SIMD compute unit thread. LLVM uses an intermediate representation of RISC instructions like add, multiply, call, and branch. The LLVM framework optimizes and compiles to the target instruction set.

The HIP runtime provides a drop-in replacement for the CUDA runtime API. It handles functionality like loading a GPU binary program, copying memory from the CPU DRAM to the GPU HBM, launching threads, and GPU synchronization. HIP mimics the CUDA interface by replacing CUDA calls with HIP—e.g. cudaMalloc becomes hipMalloc. This enables porting frameworks like PyTorch to AMD by replacing CUDA calls with HIP. Thankfully, AMD has automated much of this via tools like HIPIFY.

The third layer is optimized libraries, essential for achieving good performance. Matrix multiplication is a core operation in transformer architectures. There is a long history of optimizing matrix multiply in HPC, culminating in the BLAS (Basic Linear Algebra Subroutines) standard. AMD implements and optimizes GEMM (General Matrix Multiply) in the rocBLAS library, achieving 89% of theoretical peak performance.

Another key operation in deep learning is data parallelism, where multiple GPUs cooperate to train a model. A common data parallel primitive is all-reduce, originally from the MPI standard. It was optimized with a ring algorithm for deep learning by my Baidu SVAIL team, especially Shubho Sengupta, improving bandwidth utilization compared to other HPC applications. AMD implements optimized all-reduce in the rCCL library.

Now that we have an optimized foundation, we need a deep learning framework to run an entire model. AMD has integrated ROCm support into PyTorch, a mature framework with automatic differentiation, backpropagation, tensor operations, and models like Transformers. PyTorch compiles models into computational graphs of operations like GEMM, rCCL for all-reduce, and HIP for memory/kernel management.

For advanced operations, AMD uses OpenAI Triton, an open-source, high-performance kernel language integrated into PyTorch. Triton provides a Python-based kernel programming model, similar to CUDA C or OpenCL, with just-in-time compilation. This enables declaring and calling custom Triton kernels directly in PyTorch models. AMD uses Triton for optimizations like kernel fusion in the PyTorch backend. We have used Triton for custom kernels, such as float8 and flash attention.

PyTorch 2.0 and subsequent versions offer a way to break down your model into the Triton language using the TorchInductor backend, made possible through the "torch.compile" operator. This feature is driven by PyTorch's TorchInductor compiler, which translates PyTorch operators into various lower-level backends targets, including Triton. Since its enablement in PyTorch 2.0, it has support for AMD Instinct and Radeon GPUs.

Lamini builds on top of AMD's ROCm platform with additional software layers specialized for enterprise LLM applications. On top of the layers is the Lamini SDK, including common LLM use cases like chat, classification, autocomplete, and more.

The Docs to QA (Chat)SDK demonstrates generating question-answer pairs, finetuning Llama 2, and connecting the model to a chat interface. The RAG SDK augments the finetuned model with data that can be retrieved from an index that is coupled to the language model. The LLM Operator SDK extends chat with more advanced workflows like planning. Lamini uses finetuning to teach the model calling tools correctly. We integrate JsonFormer to force it to conform to API specifications when calling external tools, such as saving a user response and querying a recommendation system. Try more SDKs yourself here.

The SDKs use Python and TypeScript API clients for loading, querying, and finetuning LLMs with just a few lines of code. The clients communicate with a REST API server that handles standard LLM APIs like text completion. The server can be deployed as a scalable containerized service. We also have a web interface where you can manage training jobs, see eval results, and test your model.

Lamini uses and extends SLURM to enable distributed, multi-GPU finetuning. SLURM packages the SLURM workload manager into containers, providing efficient scaling and resource management for computationally demanding model training. Finetuning jobs from the REST API are executed by dynamically spinning up SLURM Queen containers on GPU clusters. This takes advantage of SLURM's strengths - locality-aware scheduling and scaling single applications like model training across many devices. In contrast to Kubernetes, which focuses on CPU microservices, SLURM excels at scaling GPU workloads like distributed finetuning of giant LLMs. SLURM Queen integrates SLURM with OCI infrastructure to provide optimized scaling and resource management for Lamini's finetuning workloads.

Lamini's containerized design is infrastructure agnostic, enabling deployment on any platform supporting OCI containers. Containerization also enhances security by isolating components. Lamini can operate without external internet access and run in Secure Enclaves with strong data leak prevention guarantees. We’ve even deployed to highly secure hospital networks.

Overall, Lamini can horizontally scale out finetuning and inference workloads across thousands of MI GPUs. By relying on cloud-neutral MI GPUs with no supply constraints, Lamini can acquire the compute resources needed to scale massively.

Optimizing LLM performance

Scaling laws prescribe a simple recipe that turns computation into intelligence. As shown in previous work by my team at Baidu’s Silicon Valley AI Lab and by OpenAI and DeepMind, scaling laws govern the performance of large language models - demonstrating that computation and data can be traded for intelligence, reinforcing the importance of accelerated computing.

Lamini has incorporated several new optimizations that accelerate LLMs and take advantage of unique capabilities of AMD’s MI platform. These optimizations enable hosting 200 billion parameter models on a single server, 10,000 finetuned language models on one server, handling 12,800 simultaneous requests to a single server, and processing over 3.5 million queries per day on one node.

As users finetune models on new and different data, the number of different fine tuned models grows rapidly. For example, in three weeks, Lamini free tier users trained 5758 different models. Lamini optimizes this use case by developing a PEFT adaptor cache. This works by freezing most of the weights of the LLM during finetuning, and only backpropagating updates into a subset that is reserved based on the amount of data being trained on. For example, if a 13B parameter Llama 2 model is trained using a 30MB adaptor, this results in 433x fewer weights to page in. The Lamini optimized inference cache uses the extra HBM capacity in MI250 to store these adaptors. When a new request comes in, a few pointers in HBM are updated in approximately 2.93us, making model switching nearly instant. This enables one server to support more than 10,000 different models.

ROCm provides a foundation for running PyTorch apps in containers. However, high performance inference requires handling many simultaneous requests with low latency. Lamini's inference server supports up to 12,800 concurrent requests and 3.5 million per day. It uses a fast-api webserver on uvicorn that handles high concurrency. Lamini batches requests into tokens for GPU submission to reduce latency. Using Orca-inspired per-token batching, results stream back one token at a time without blocking new requests. This enables 18ms per-token latency for a 13B Llama2 LLM on Lamini.

Lamini has already discovered several performance optimizations for finetuning, building on foundation model optimizations like mixed precision training, ring allreduce, and model parallelism. Further finetuning optimizations will enable more powerful finetuned models following scaling laws.

ROCm provides a mature foundation to implement these optimizations, given the hardware architecture and software support. This can facilitate LLM development with performance innovations like Lamini's, changing the landscape of large language model training.

--

Greg Diamos

Co-founder & CTO at Lamini

October 25, 2023

Lamini