How to specialize general LLMs to private data

Lamini

TL;DR

Rapidly train an LLM to perform better on your proprietary data, than any LLM out there!
Play with an example: a better LLM for chatting over internal engineering documentation, trained by full-stack software engineers
Follow the steps: write your own code, connect your own data, host on your own infrastructure
Sign up for LLM training on your own infrastructure, e.g. VPC or on premise.

Easy, fast LLM training ⚡

The AI revolution is here. You’ve been prompt-engineering LLMs and asking yourself two questions:

Is prompt engineering really all I can do to make these things better?
Realistically, what’s my job as a software engineer going to be like in the future?

Cue: easy, fast LLM training.

The future of software engineering will be architecting a new layer of LLM infrastructure above foundation models. It will be about steering LLMs towards better performance with powerful programs and robust data pipelines.

From the perspective of AI researchers, the LLMs that you are playing with today will be the worst ones you will have used in the next decade.

That is to say, this is just the beginning of improving LLMs. You’ll get to heavily personalize LLMs with your own data. And, LLMs will dramatically transform user experience, lowering the barriers to entry on every product and feature you’ve built or seen before.

But steering LLMs like that feels impossible now, for many reasons:

Finetuning APIs don’t work: they seem to make models worse.
Data breaches are a real threat to your business’s core IP, and a real violation to your customers’ trust in your service.
Off-the-shelf AI services change all the time, rendering many of them unreliable.
Building out a team of 50+ AI researchers feels like a time-consuming and cost-prohibitive task.

This can’t possibly be the only way to build in this AI revolution. You’re right.

That’s why we’re excited to show a Lamini demo for any software engineer to specialize the most powerful LLMs to their use case, on proprietary data and infrastructure.

Top technology leaders have told us:

"We couldn't have gotten to this level of LLM use and accuracy without Lamini."
Lamini was the "best" and clearly the "closest" to their use case, in a blind test comparing the model to ChatGPT with retrieval.

Case study: Train an LLM to hallucinate less & understand more information 📚

*Table:* *Play with this LLM* *live now! Just use your Google account to sign into Lamini and start asking questions. Please note that the results are always improving on* *our live version, so expect some differences.*

Internal engineering documentation (and code) can be difficult to navigate and find the relevant information, to understand the code structure, and to identify dependencies. It would be helpful to ask someone knowledgeable about that part of the codebase to get the right answer immediately. But those people are often hard to reach.

Now, an LLM that has read all of your code and documentation could help both you and your customers navigate it. In many cases, this would need to run locally to keep your source code private. We had the same idea, so we set out to prompt-engineer a model with retrieval to do this.

However, in addition to data privacy concerns, off-the-shelf solutions were not able to achieve good performance for this use case. They:

Make things up (hallucinate)
Provide inconsistent responses
Are sometimes too verbose, other times not descriptive enough
Respond too slowly

So, we used the Lamini library to specialize a general LLM to this specific use case, by training it on all of Lamini’s internal engineering documentation.

In the above Table, you can compare the two approaches of a Lamini-optimized LLM, e.g. with training, and a prompt-engineered LLM, e.g. with retrieval. The Lamini-optimized LLM does not hallucinate false information (row 1), is able to find the relevant information (row 2), and tries to steer the conversation back on track when the user tries to ask other things (row 3).

How to prepare data and train LLMs with Lamini 🦙

You can train LLMs using Lamini, by writing code to connect your data from your data warehouse or data lake.

‍Define the LLM interface using Lamini Types. You want it to be a chatbot? The interface is question in, answer out. You want it to be a code copilot? The interface is programs in, more programs out. Run a general LLM (aka. base model or foundation model) using your types.‍
Find relevant data and create Lamini Types. What data would be useful to an expert human performing the task? Get that data and create (additional) Lamini Types that match its schema. It can be supporting documentation, like functions in your documentation, for your code chatbot, or it could be sample questions that would be asked to your bot.‍
Load data into your Types and load Types into your LLM using Lamini. This is casting your data into the Types format so that an LLM can best learn from it.
‍Get data that matches your LLM interface. Don’t have any? No problem. That’s what data generation is for, with a pipeline of LLMs. First, run data generation with the Lamini LLM Engine to get more data of the right Lamini Types, any of them. Then, filter the data using Lamini filters or your own scripts to get high-quality data.
‍Specialize a general LLM with optimized training. Using the Lamini library, train your LLM to all of your data.

To do this on your own infrastructure, you just need to install Lamini locally. Sign up for our waitlist or start training on our infrastructure now!

Team++: We are growing our team with people who are passionate about making LLMs widely accessible to empower new, extraordinary use cases. If that’s you, please apply via https://jobs.lever.co/laminiai 🤝

‍

Published on June 15, 2023