Why is writing a prompt so easy, but training an LLM from a base model still so hard? Iteration cycles for fine-tuning on modest datasets are measured in months because it takes significant time to figure out why fine-tuned models fail. Conversely, prompt-tuning iterations are on the order of seconds, but performance plateaus in a matter of hours. Only a limited amount of data can be crammed into the prompt, not the terabytes of data in a warehouse.
It took OpenAI months with an incredible ML team to fine-tune and run RLHF on their base GPT-3 model that was available for years — creating what became ChatGPT. This training process is only accessible to large ML teams, often with PhDs in AI.
Technical leaders at Fortune 500 companies have told us:
That’s why we’re building Lamini: to give every developer the superpowers that took the world from GPT-3 to ChatGPT.
Lamini is an LLM engine that allows any developer, not just machine learning experts, to train high-performing LLMs, as good as ChatGPT, on large datasets with just a few lines of code from the Lamini library (check out an example here!).
The optimizations in this library reach far beyond what’s available to developers now, from more challenging optimizations like RLHF to simpler ones like reducing hallucinations.
Lamini makes it easy to run multiple base model comparisons in just a single line of code, from OpenAI’s models to open-source ones on HuggingFace.
Now that you know a bit about where we’re going: today, we’re excited to release our first major community resource!
We are excited to release several important steps to training your own LLM:
Base models have a good understanding of English for consumer use cases. But when you need them to learn your vertical-specific language and guidelines, prompt-tuning is often not enough and you will need to build your own LLM.
Here are the steps to get an LLM that follows instructions to handle your use case like ChatGPT:
Lamini delivers the ease of prompt-tuning, with the performance of RLHF and fine-tuning. It will soon handle this entire process (sign up for early access!).
For your application, you might want similar "instruction-following" data, but you could also want something completely different, like responding only in JSON.
ChatGPT took the world by storm because it could follow instructions from the user, while the base model that it was trained from (GPT-3) couldn’t do that consistently. For example, if you asked the base model a question, it might generate another question instead of answering it.
You'll need a dataset of ~50k instruction-following examples to start. Don't panic. You can now use Lamini’s hosted data generator to turn just 100 examples into over 50k in just a few lines of code.
You don’t need to spin up any GPUs, because Lamini hosts it for you. All the data that is used is commercial-use-friendly, meaning you own all the data that comes out of it.
You can customize the initial 100+ instructions so that the LLM follows instructions in your own vertical. Once you have those, submit them to the Lamini data generator, and voilà: you get a large instruction-following dataset on your use case as a result!
The Lamini data generator is a pipeline of LLMs that takes your original small set of 100+ instructions, paired with the expected responses, to generate 50k+ new pairs, inspired by Stanford Alpaca.
This generation pipeline uses the Lamini library to define and call LLMs to generate different, yet similar, pairs of instructions and responses. Trained on this data, your LLM will improve to follow these instructions.
We provide a good default for the generation pipeline that uses open-source LLMs, which we call Lamini Open and Lamini Instruct. With new LLMs being released each day, we update the defaults to the best-performing models.
As of this release, we are using EleutherAI’s Pythia for Lamini Open and Databricks’ Dolly for Lamini Instruct. Lamini Open generates more instructions, and Lamini Instruct generates paired responses to those instructions.
The final generated dataset is available for your free commercial use (CC-BY license).
The Lamini library allows you to swap our defaults for other open-source or OpenAI models in just one line of code. Note that while we find OpenAI models to perform better on average, their license restricts commercial use of generated data for training models similar to ChatGPT.
If you’re interested in more details on how our data generator works, read more or run it here.
Some of the generated data is good, some not. Before fine-tuning, the next step is to filter the generated data to mostly high-quality data (just run this simple script in the same repo). Lamini then creates a custom LLM by training a base model on this filtered, generated dataset.
We have released an open-source instruction-following LLM (CC-BY license) using Lamini to train the Pythia base model with 37k generated instructions, filtered from 70k. Play with this custom LLM in the playground now.
We’re excited to dramatically improve the performance of training LLMs and make it easy for engineering teams to train them. These two frontiers are intertwined: with faster, more effective iteration cycles, more people will be able to build these models, beyond just fiddling with prompts. We exist to help any company unlock the power of generative AI by making it easy to put their own data to work.
Team++: We are growing our team with people who are passionate about making it possible to build LLMs 10x faster and making them widely accessible to empower new, extraordinary use cases. If that’s you, please apply via https://jobs.lever.co/laminiai 🤝