Lamini & AMD: Paving the Road to GPU-Rich Enterprise LLMs

TL;DR

  • We’re unveiling a big secret: Lamini has been running LLMs on AMD InstinctTM GPUs over the past year—in production. Enterprise customers appreciate the top-notch performance.
  • Lamini is an exclusive way for enterprises to easily run production-ready LLMs on AMD Instinct GPUs—with only 3 lines of code today.
  • Join Fortune 500 enterprises and run and finetune LLMs in your VPC or on-premise with Lamini.

LLMs are the new IP

Demand for enterprise LLMs is exploding. Over 5000 companies have joined Lamini’s waitlist since we launched several months ago. In a recent survey from AMD[1] of technology decision-makers, over 75% reported increasing AI investment, with 90% already having significant returns.

iFit is one such enterprise: a market leader in the fitness space with over 6.4 million users on their apps and the owners of NordicTrack. LLMs are the new IP for them as they scale out their AI fitness coach to “generate personalized workout plans for [each user’s] specific fitness goals,” said Chase Brammer, CTO at iFIT. "Using a public LLM wasn’t enough: we needed something that we could easily and quickly personalize to our customers’ data and constantly improve on new data while keeping all of our data private." Chase continued.

The significant benefit of owning your own LLM is the ability to control and personalize it. And you get to uniquely reap the profits from it, too. As Chase puts it: “The best way for us to build these powerful LLM capabilities was through Lamini and the Enterprise LLM that Lamini makes possible for our engineering team to build.”

Another Lamini customer is AMD, who delivers a broad portfolio of high-performance GPUs, CPUs, and adaptive computing solutions, in an open, proven, and ready way for AI. Internally, AMD is building and deploying LLMs for numerous use cases with prompt-engineering and retrieval. However, there was a demand for more performance and capabilities, so AMD turned to Lamini for its expertise in LLM finetuning. As Vamsi Boppana, SVP of AI at AMD, says,

"We're excited to work with Lamini to customize and personalize models to AMD users and high value use cases. We’ve deployed Lamini in our internal Kubernetes cluster with AMD Instinct GPUs, and are using finetuning to create models that are trained on AMD code base across multiple components for specific developer tasks."—Vamsi Boppana, SVP of AI at AMD

Lamini: the enterprise LLM platform for finetuning

Like iFit, the top priority of many enterprises is to build differentiated AI offerings. The goal? To create LLM products that capture as much commercial success as Github Copilot or ChatGPT, with over $1B in revenue and a competitive data moat to protect them.

However, achieving that goal is hard when the two options in the market seem to be: (1) convince 200 unhirable top AI researchers and engineers to join next week and your AWS rep to give you 100 NVIDIA H100s, or (2) build undifferentiated hobbyist projects with a weekend hackathon.

It turns out that #1 is possible today without the whole team joining next week. Lamini makes finetuning LLMs easy for any engineer. Finetuning is the superpower that took a research project called GPT-3 in 2020 and turned it into ChatGPT, used by millions of people.

Lamini is built by a team finetuning LLMs over the past two decades: we invented core LLM research like LLM scaling laws, shipped LLMs in production to over 1 billion users, taught nearly a quarter million students online (Finetuning LLMs), mentored the tech leads that went on to build the major foundation models: OpenAI’s GPT-3 and GPT-4, Anthropic’s Claude, Meta’s Llama 2, Google’s PaLM, and NVIDIA’s Megatron.

Lamini is optimized for enterprise finetuning LLMs, which have big data and use specialized data, tasks, and software interfaces. We build on top of foundation models. Foundation models like Llama 2, GPT-4, and Claude are optimized for general skills like English, autocomplete, reasoning, and programming by training on general-purpose datasets like CommonCrawl, the Pile, or textbooks.

Lamini includes advanced optimizations for enterprise LLMs, built on and extending PEFT (LoRA), RLHF, and toolformer, to provide data isolation across 4,266x models on the same server, speed up model switching by 1.09 billion times, compress models by 32x, and easily integrate LLMs with enterprise APIs without hyperparameter search. We have found that existing open-source libraries are not optimized for these enterprise use cases at all, leading to huge opportunities, e.g., for 1B times faster model switching.

LLM Superstation: Production-ready LLMs on AMD InstinctTM GPUs

What’s more, with Lamini, you can stop worrying about the 52-week lead time for NVIDIA H100s. Using Lamini exclusively, you can build your own enterprise LLMs and ship them into production on AMD Instinct GPUs. And shhhh… Lamini has been secretly running on over one hundred AMD GPUs in production all year, even before ChatGPT launched. So, if you’ve tried Lamini, then you’ve tried AMD.

Now, we’re excited to open up LLM-ready GPUs to more folks. Our LLM Superstation is available both in the cloud and on-premise. It combines Lamini's easy-to-use enterprise LLM infrastructure with AMD Instinct™ MI210 and MI250 accelerators. It is optimized for private enterprise LLMs, built to be heavily differentiated with proprietary data.

Lamini is the only LLM platform that exclusively runs on AMD Instinct GPUs — in production. Ship your own proprietary LLMs! Just place an LLM Superstation order to run your own Llama 2-70B out of the box—available now and with an attractive price tag (10x less than AWS).

Lamini Data Center with AMD Instinct GPUs

Many of Lamini’s customers are finetuning and running Llama 2 on LLM Superstations—and owning those LLMs as their IP. Llama 2 is a state-of-the-art open-source LLM built by Meta AI. Joe Spisak, Product Director and Head of Generative AI Open Source at Meta AI, echoes the excitement around Llama 2:

“Generative AI is quickly becoming a transformational technology for startups and enterprises. We are excited that this innovation is being built on open technology and that Llama 2 is becoming the foundation of some of the most innovative companies.”
Lamini Co-founder & CTO, Greg Diamos at our data cente

Benchmarking LLM performance

Running and finetuning the largest LLMs rapidly needs high-performing infrastructure. One of the big questions is: how does AMD compare with NVIDIA? As a former early architect on CUDA at NVIDIA, the cofounder of MLPerf, and the CTO at Lamini, Greg Diamos, says:

"Using Lamini software, ROCm has achieved software parity with CUDA for LLMs. We chose the Instinct MI250 as the foundation for Lamini because it runs the biggest models that our customers demand and integrates finetuning optimizations. We use the large HBM capacity (128GB) on MI250 to run bigger models with lower software complexity than clusters of A100s."—Greg Diamos, CTO at Lamini

Below, we have the TFLOP/s numbers for small and large matrix GEMM and hipMemcpy running on MI210 using rocBLAS 5.6.0. We test bfloat16 inputs accumulated into float32 outputs to demonstrate a common case that ensures training stability. Our benchmark test shows that ROCm achieves up to 166 TFLOP/s (89% of theoretical peak TFLOP/s) and 1.18TB/s (70% of peak BW).

This shows AMD's libraries effectively tap into the raw throughput of MI accelerators for key primitives. With basic building blocks operating efficiently, ROCm provides a solid foundation for high-performance applications like finetuning LLMs.

To enable finetuning on clusters with hundreds of AMD Instinct GPUs, Lamini leverages several layers of specialized software. First, Lamini includes a high-performance inference server that allows running large models with low latency and high throughput by leveraging model caching and dynamic batching. Linked with this server is the PEFT capability to finetune tens of thousands of finetuned LLMs efficiently.

One common LLM pattern is retrieval augmented generation, which is optimized on Lamini by pushing an embedding cache directly into GPU HBM memory colocated with the LLM. Finally, Lamini is able to horizontally scale LLMs across large clusters of thousands of MI GPUs using an inference load balancer and containerized auto-scaling SLURM. Lamini LLM superstations have zero lead time and no hardware shortage.  They can be networked together to create powerful finetuning and inference systems.

Lamini’s software, coupled with AMD’s hardware, makes for a powerful experience, with Chase Brammer at iFit saying: “It was simple to iterate and deploy with a few lines of code and amazingly fast with the AMD Instinct™ hardware.”

Lamini & AMD partnership

Lamini and AMD are partnered to build high-performance LLMs on AMD, making generative AI significantly more usable and accessible.

“Building LLMs should be easy: every enterprise should be able to own LLM IP, just like they do for all their other software. We’re excited to partner with AMD because their GPUs unlock a huge opportunity for enterprises to get started with little to no lead time.” Sharon Zhou, CEO at Lamini.

--

September 26, 2023

Lamini