Fine-tune your own Llama 2 to replace GPT-3.5/4

submitted by kcorbi+(OP) on 2023-09-12 16:53:51 | 955 points 181 comments
[source] [go to bottom]

There has been a lot of interest on HN in fine-tuning open-source LLMs recently (eg. Anyscale's post at https://news.ycombinator.com/item?id=37090632). I've been playing around with fine-tuning models for a couple of years, and wanted to share some insights and practical code. I’ve condensed what I’ve learned into a small set of notebooks at https://github.com/OpenPipe/OpenPipe/tree/main/examples/clas..., covering labeling data, fine-tuning, running efficient inference, and evaluating costs/performance. The 7B model we train here matches GPT-4’s labels 95% of the time on the test set, and for the 5% of cases where they disagree it’s often because the correct answer is genuinely ambiguous.

What is fine-tuning? You can think of it as a more-powerful form of prompting, where instead of writing your instructions in text you actually encode them in the weights of the model itself. You do this by training an existing model on example input/output pairs that demonstrate the task you want your fine-tuned model to learn. Fine-tuning can work with as few as 50 examples but I usually try to get 1000+ if possible.

Prompting still has some big advantages over fine-tuning. It's way easier/faster to iterate on your instructions than label data and re-train a model. And operationally it's easier to deploy one big model and just adjust its behavior as necessary vs deploying many small fine-tuned models that will likely each get lower utilization.

Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!

For example, classifying the 2M recipes at https://huggingface.co/datasets/corbt/all-recipes with GPT-4 would cost $23k. Even with GPT-3.5 it would cost over $1k. The model we fine-tuned performs similarly to GPT-4 and costs just $19 to run over the entire dataset.

Disclaimer: My brother David and I are working on an open-source product called OpenPipe (https://github.com/openpipe/openpipe) to help engineers adopt fine-tuning as simply as possible. But none of the information above depends on our startup. The current post is just about sharing information that we’ve learned about fine-tuning. I hope it’s useful!

NOTE: showing posts with links only show all posts

>>minima+08
We're finding that when running Llama-2-7B with vLLM (https://github.com/vllm-project/vllm) on an A40 GPU we're getting consistently lower time-to-first-token and lower average token generation time than GPT-3.5, even when processing multiple requests in parallel. A40s are pretty easy to get your hands on these days (much easer than A100s anyway).

The 50x cheaper (that's 2% of the cost, not 50% of the cost) number does assume 100% GPU utilization, which may or may not be realistic for your use case. If you're doing batch processing as part of a data pipeline, which is not an unusual use case, you can run your GPU at 100% utilization and turn it off when the batch finishes.

If you've got a highly variable workload then you're right, you'll have much lower utilization numbers. But if you work with an aggregator that can quickly hot swap LoRA fine-tunes (as a disclaimer, my company OpenPipe works in this space) you can get back a lot of that lost efficiency since we can increase/decrease GPU capacity only when our aggregate usage changes, which smooths things out.

>>kcorbi+(OP)
Very nice, thanks!

Check out what Matt Shumer put together as well: https://github.com/mshumer/gpt-llm-trainer.

I have used his trainer for auto distillation of GPT-4 into GPT3.5 fine tunes, but plan to do the same for Llama as well.

Cheers!

>>ronyfa+wk
For use cases well within the capabilities of an LLM from last year, fine-tuned LLaMa 2 13B should/will blow ChatGPT out of the water: think "rate the sentiment of this text from 0-10".

I believe this because LLaMa-2 13B is more than good enough to handle what I call "quick search", i.e.

``` User: "What's the weather in Milwaukee?"

System: Here's some docs, answer concisely in one sentence.

AI: It's 73 degrees Farenheit. ```

YMMV on cost still, depends on cloud vendor, and my intuition agrees with yours: GPT-3.5 is priced low enough that there isn't a case where it makes sense to use another model. It strikes me now that's there's a good reason for that intuition: OpenAI's $/GPU hour is likely <= any other vendor's and inference time of LLaMa 2 ~= GPT.

I do think this will change with local LLMs. They've been way over-hyped for months, but after LLaMa 2, the challenges remaining are more sociological than technical.

For months now it's been one-off $LATEST_BUZZY_MODEL.c stunts that run on desktop.

The vast majority of the _actual_ usage and progress is coming from porn-y stuff, and the investment occurs in one-off stunts.

That split of effort, and lack of engineering rigor, is stunting progress overall.

Microsoft has LLaMa-2 ONNX available on GitHub[1]. There's budding but very small projects in different languages to wrap ONNX. Once there's a genuine cross-platform[2] ONNX wrapper that makes running LLaMa-2 easy, there will be a step change. It'll be "free"[3] to run your fine-tuned model that does as well as GPT-4.

It's not clear to me exactly when this will occur. It's "difficult" now, but only because the _actual usage_ in the local LLM community doesn't have a reason to invest in ONNX, and it's extremely intimidating to figure out how exactly to get LLaMa-2 running in ONNX. Microsoft kinda threw it up on GitHub and moved on, the sample code even still needs a PyTorch model. I see at least one very small company on HuggingFace that _may_ have figured out full ONNX.

Funnily enough, ONNX is getting a spike in mindshare over the last month in the _Stable Diffusion_ community. There's decent cross-pollination between local art and local LLMs, ex. LoRA's were first a thing for Stable Diffusion. So I'm hoping we see this sooner rather than later.

[1] https://github.com/microsoft/Llama-2-Onnx

[2] Definition of cross-platform matters a ton here, what I mean is "I can import $ONNX_WRAPPER_LIB on iOS / Android / Mac / Windows and call Llama2.reply(String prompt, ...)"

[3] Runs on somebody else's computer, where "somebody else" is the user, instead of a cloud vendor.

>>3abito+4w
There are already many hundreds of finetunes on huggingface, and many excellent UIs to run them in, like KoboldCPP and Text-gen-ui: https://huggingface.co/models?sort=modified&search=13B

There is even a crowdsourced version of the UI like artbot: https://lite.koboldai.net/#

And there are some excellent extant finetuning frameworks, like Aoxotol, that run on consumer GPUs: https://github.com/OpenAccess-AI-Collective/axolotl

IIRC Text-gen-ui had a QLORA finetuning UI too.

What I am saying is that its already like Stable Diffusion, but the community is just somewhat under the radar, and finetuning will never be quite as turnkey as dreambooth/sd 1.5 LORA due to the nature of the training data.

>>kcorbi+(OP)
I found this tutorial helpful for getting started with fine-tuning https://www.youtube.com/watch?v=74NSDMvYZ9Y

This guy used gradient.ai and he has a Google Collab to try it

>>divbze+HE
The Huggingface Leaderboard is mostly dominated by Llama 2 variants: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

It depends a lot on what you're trying to do. If have a focused use case of the type of fine-tuning you want, you can probably get away with one of the smaller models.

Another thing to look out for is Retrieval Augmented Generation (RAG). I don't see it in wide use yet, but it may turn out to more useful than fine tuning for a lot of situations.

>>divbze+HE
It's one of widely fine tuned model for now. Take a look at this colab for fine tuning on your dataset https://github.com/mlabonne/llm-course/blob/main/Fine_tune_L...

>>refulg+ms
you already have TVM for the cross platform stuff

see https://tvm.apache.org/docs/how_to/deploy/android.html

or https://octoml.ai/blog/using-swift-and-apache-tvm-to-develop...

or https://github.com/mlc-ai/mlc-llm

>>yumraj+Y61
Mixture of Experts Model - https://en.wikipedia.org/wiki/Mixture_of_experts

>>thewat+YS
I stumbled upon OpenRouter[0] a few days ago. Easiest I’ve seen by far (if you want SaaS, not hosting it yourself).

[0] https://openrouter.ai

>>all2+hh1
Sure! I'm building a personalized AI language learning tutor using Open AI's API and ElevenLabs (for Text to Speech).

Right now it's basically a chat bot that you can use to practice conversing with. It provides corrections for the things you type. Eventually I'd like to try adding Whisper as well to allow users to speak out loud.

When you hover over a word, you get a translation. Initially I thought using Open AI for every word translation would be too much, but I've been able to get it down to ~36-40 tokens/request. (3-4 cents/1000 requests). I also began parsing and uploading some of this [Wiktionary data](https://kaikki.org/dictionary/rawdata.html) and am working on a feature that integrates the GPT-3.5 translation with this Wiktionary data.

A lot of these features are still in the works but you can feel free to try it if you like (https://trytutor.app).

>>apstls+uG1
Lots. LLAMA 2 was trained on 4K context windows but can run on arbitrary length just the results become garbage as you go longer.

I refer you to https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-c... for an "easy" to digest summary

>>NavinF+Vy
That seems mostly right, particularly for internal models, but I wonder about adding some ringers to prove that copying happened:

https://en.m.wikipedia.org/wiki/Trap_street

Also, it seems sort of like how cryptocurrency folks assumed their transactions were anonymous? It's an API, so they could log the calls. (Maybe not the contents.)

>>unoti+FG
Azure GPT 4 is already available in: Australia East, Canada East, East US, East US 2, France Central, Japan East, Sweden Central, Switzerland North, UK South (https://learn.microsoft.com/en-us/azure/ai-services/openai/c...)

>>behnam+S02
Yep! The linked notebook includes an example of exactly that (fine-tuning a 7b model to match the syntax of GPT-4 function call responses): https://github.com/OpenPipe/OpenPipe/blob/main/examples/clas...

>>bfirsh+Z71
We're working on LLM Engine (https://llm-engine.scale.com) at Scale, which is our open source, self-hostable framework for open source LLM inference and fine-tuning. We have similar findings to Replicate: Llama 2 70B can be comparable to GPT 3.5 price, etc. Would be great to discuss this further!

>>atleas+WX1
If we assume this is true: https://iv.nboeck.de/watch?v=K5iDUZPx60E&t=2989

Then there isn't anything in particular which makes their model(s) stand out. On the contrary, they seem rather inefficient, which is probably reflected on the inference cost this gargantuan conglomerate takes to run.

>>Anonym+Lt1
I don't think this is a promise Microsoft can make. The US Cloud Act states that Microsoft falls under US jurisdiction and it's legally bound to share foreign data if asked by US law enforcement.

"The CLOUD Act asserts that U.S. data and communication companies must provide stored data for a customer or subscriber on any server they own and operate when requested by warrant, but provides mechanisms for the companies or the courts to reject or challenge these if they believe the request violates the privacy rights of the foreign country the data is stored in."

https://en.wikipedia.org/wiki/CLOUD_Act

>>carom+bG1
They do pretty well, except the Room_641A in the building which is allowed to do anything they what with production branch without it being visible to ordinary workers.

https://en.m.wikipedia.org/wiki/Room_641A

>>ronyfa+wk
I’m actually replicate user. I have experimented with LLAMA2 on the replicate and I have similar experience

But you are totally correct about the pricing part it can get expensive

I’m running this photo service https://msdosimagetools.ngrok.dev/

Its doing 200+ photos every day and I’m using open source models behind the scene on replicate. My costs increasing day by day

Plus this is hosted locally

>>ronyfa+wk
It shouldn't be 100x. We've built an LLM API at Anyscale, and the price comparison works out as follows (per million tokens)

- Llama-2-70B: $1 (on Anyscale Endpoints [1]) - GPT-3.5-turbo: $1.50 - $2 (OpenAI [2])

[1] https://app.endpoints.anyscale.com/ [2] https://openai.com/pricing

>>haxton+rQ
These sites say 154B:

https://www.ankursnewsletter.com/p/gpt-4-gpt-3-and-gpt-35-tu...

https://blog.wordbot.io/ai-artificial-intelligence/gpt-3-5-t...

>>kcorbi+aF
Axolotl takes a lot of formats, not all of them are in the form of input/output.

"Completion" format only takes a single text value per dataset record. Some other formats are in the form of multiple choice answers, etc.

Take a look below (there are more formats in "see other formats") https://github.com/OpenAccess-AI-Collective/axolotl#dataset

>>ttt3ts+oz1
From people hosting image generation models on Stable Horde I've heard that you can pretty severely underclock/undervolt your GPUs and keep them stable, massively reducing heat output and energy cost without losing nearly as much performance. I'm not sure if this transfers into text generation or not, this was from image generation workers that have a few seconds downtime between requests; however it might be worth a bit of research if you happen to be running consumer GPUs.

----- From TheUnamusedFox, in August: > 3090 down to ~260-270 watts (from 400) with minimal gen speed impact. Same with a 3080ti. It seems to be more stable with image generation than gaming, at least on my two cards. If I try to game or benchmark with this undervolt it is an instant crash.

From another user:

> this undervolting stuff is pretty sweet. > undervolted_limits.png [1] > max_power_limits.png [2] > this is my before and after. > a solid 200 watt drop for only 9.2% loss of performance > not to mention the 30 degree drop in temps

[1]: https://cdn.discordapp.com/attachments/1143237412663869570/1... [2]: https://cdn.discordapp.com/attachments/1143237412663869570/1...

zlacker

Fine-tune your own Llama 2 to replace GPT-3.5/4