So you want to build an app with multiple LLMs, each finetuned for a specific task. For example, one LLM generates SQL to access data from a relational database, another summarizes the results, and finally, a third answers users’ questions given the results. Or, you want to personalize an LLM to each of your million users.
With fast model switching using PEFT, you can run all these LLMs on the same server! For 1000 users, that's saving you $30M per year on cloud compute ($20K/month per Amazon EC2 P4 with 8 NVIDIA A100 GPUs).
At Lamini, we are removing barriers to building your own customized LLMs. Today, we are tearing down another obstacle – model switching time is now 1.109 billion times faster thanks to our new PEFT adapter cache. This cache stores over 10,000 adapters in GPU high bandwidth memory (HBM) running over 1.5 TB/s.
Lamini already simplifies fine-tuning, allowing you to train models like Llama v2 in a few lines of code. This enables rapid iteration of your language model by correcting mistakes, editing guardrails, and incorporating user feedback. However, fast iteration produces many models.
Lamini customers have already trained 5758 models in just three weeks. And they want to use them whenever they want.
Each model is huge: 13B-parameter Llama v2, for example, requires 26 GB to store in float16 format. Storing thousands of these models, e.g. personalized to each customer, is problematic. We expect over 10,000 models soon, totaling over 250TB. Caching 250TB on any single server is infeasible. So, we have to use distributed storage such as Azure Blob, S3, NFS, etc., which creates a big problem during inference.
When fetching weights for new inference requests, a naive solution is to load the model over the network, meaning copying 26GB over a network connection every time we access a new model. On a 100Mbps connection, switching models takes almost an hour. As users train more models, this switching time ballons.
For example, one user submitted 10 new model training requests — they were shocked by the long loading time – 10 hours!
Our solution is Parameter Efficient Finetuning (PEFT) , which freezes most weights during finetuning, only updating small subsets of weights, called adapters, using backpropagation. The adapters only need 10-30MB storage instead of tens of GBs for the entire model. This accelerates loading from disk by 1000x and enables caching a large number of adapters in high-bandwidth GPU memory. For example, with tens of GBs of GPU HBM, we can cache 10,000+ adapters, slashing switch time from 3250 seconds to 2.93 μs.
If you think 10x is good, this is a whopping 1,109,000,000x faster!
Now that model switching is over a billion times faster on Lamini, you can finetune as many LLMs as you want.
Get started finetuning Llama v2:
Read more about Lamini finetuning:
Stay tuned for more optimizations and resources enabling you to build your own LLMs: easier, faster, and higher-performing than general LLMs.
 LoRA - a popular PEFT method: https://arxiv.org/abs/2106.09685
- from the Lamini team on August 16, 2023