There has been a lot of interest on HN in fine-tuning open-source LLMs recently (eg. Anyscale's post at https://news.ycombinator.com/item?id=37090632). I've been playing around with fine-tuning models for a couple of years, and wanted to share some insights and practical code. I’ve condensed what I’ve learned into a small set of notebooks at https://github.com/OpenPipe/OpenPipe/tree/main/examples/clas..., covering labeling data, fine-tuning, running efficient inference, and evaluating costs/performance. The 7B model we train here matches GPT-4’s labels 95% of the time on the test set, and for the 5% of cases where they disagree it’s often because the correct answer is genuinely ambiguous.
What is fine-tuning? You can think of it as a more-powerful form of prompting, where instead of writing your instructions in text you actually encode them in the weights of the model itself. You do this by training an existing model on example input/output pairs that demonstrate the task you want your fine-tuned model to learn. Fine-tuning can work with as few as 50 examples but I usually try to get 1000+ if possible.
Prompting still has some big advantages over fine-tuning. It's way easier/faster to iterate on your instructions than label data and re-train a model. And operationally it's easier to deploy one big model and just adjust its behavior as necessary vs deploying many small fine-tuned models that will likely each get lower utilization.
Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!
For example, classifying the 2M recipes at https://huggingface.co/datasets/corbt/all-recipes with GPT-4 would cost $23k. Even with GPT-3.5 it would cost over $1k. The model we fine-tuned performs similarly to GPT-4 and costs just $19 to run over the entire dataset.
Disclaimer: My brother David and I are working on an open-source product called OpenPipe (https://github.com/openpipe/openpipe) to help engineers adopt fine-tuning as simply as possible. But none of the information above depends on our startup. The current post is just about sharing information that we’ve learned about fine-tuning. I hope it’s useful!
My other thoughts to extend this are that you could make it seamless. To start, it'll simply pipe the user's requests to OpenAI or their existing model. So it'd be a drop in replacement. Then, it'll every so often offer to the user - "hey we think at this point there's enough data that a fine tune might save you approx $x/month based on your current calls, click the button to start the fine tune and we'll email you once we have the results" - and then the user gets the email "here are the results, based on that we recommend switching, click here to switch to calling your fine-tuned model" - Helicone and the other monitoring platforms could also offer something similar. (Side note I'm working on an "ai infra handbook" aimed at technical people in software orgs looking to deploy unspecified "AI" features and trying to figure out what to do and what resources they'll need - it's a 20+ page google doc, if anyone can help me review what I have so far please let me know and I'll add you.)
If it's latency/error/speed competitive, and cheaper, and equivalently accurate, then for anyone doing production scale LLM API usage it'd make sense to use something like this - either the fine-tune is worse so you keep using the regular API, or the fine tune has parity plus cost and/or speed advantage, so you switch. (It wouldn't make sense for prototyping scale, because the additional complexity of the switch wouldn't be worth it unless it could save you 4/5 or more figures a year in API costs I'd think.)
I didn't allow them to use my output to train theirs either, so fuck 'em.
These comparisons are reductive to the point of being misleading. Even with all the optimizations in the ecosystem, it's not trivial to get a finetuned 7B param model running at an acceptable inference latency. Even if you use a GPU such as an A100 for maximum speed, then you have scalability issues since A100s are scarce. Also, the "50% cheaper" assumes 100% utilization of a GPU which will never happen in production use cases.
Quality-wise, a finetuned Llama 2 is not necessairly better than ChatGPT. Finetuning requires a high-quality dataset which is not easy to construct. And in my own experience with finetuning Llama 2, qualitivately it caused more frustration to get outputs on par with just using ChatGPT.
The value of the ChatGPT API is more dependable scaling and not having to pay for an infra.
Not to mention OpenAI has shit latency and terrible reliability - you should be using Azure models if you care about that - but pricing is also higher.
I would say fixed costs and development time is on openai side but I've seen people post great practical comparisons for latency and cost using hostes fine-tuned small models.
That said, you should be able to fine-tune a 70B model on an A100 using QLoRA. However, depending on the specifics of your dataset it might actually be cheaper to run on an 8xA100 machine since that way you don't have to swap any weights out to the machine's non-GPU memory, and you might get enough time savings from that that the more expensive machine pays for itself.
The 50x cheaper (that's 2% of the cost, not 50% of the cost) number does assume 100% GPU utilization, which may or may not be realistic for your use case. If you're doing batch processing as part of a data pipeline, which is not an unusual use case, you can run your GPU at 100% utilization and turn it off when the batch finishes.
If you've got a highly variable workload then you're right, you'll have much lower utilization numbers. But if you work with an aggregator that can quickly hot swap LoRA fine-tunes (as a disclaimer, my company OpenPipe works in this space) you can get back a lot of that lost efficiency since we can increase/decrease GPU capacity only when our aggregate usage changes, which smooths things out.
You just described our short-term roadmap. :) Currently an OpenPipe user has to explicitly kick off a fine-tuning job, but they're so cheap to run we're planning on letting users opt in to running them proactively once they have enough data so we can provide exactly that experience.
Take their example of running the llm over the 2 million recipes and saving $23k over GPT 4. That could easily be 2 million documents in some back end system running in a batch. Many people would wait a few days or weeks for a job like that to finish if it offered significant savings.
It though also demonstrates why the economics are complicated and there's no one-size-fits-all.
For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.
Llama 7B wasn't up to the task fyi, producing very poor translations.
I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).
I'm curious to see if others have gotten different results?
Interested in helping out.
You're better off using models specialized in translation; General purpose LLMs are more useful when fine-tuning on specific tasks (some form of extraction, summarization, generative tasks, etc.), or for general chatbot-like uses.
Also, it's worth noting that Replicate started out with a focus on image generation, and their current inference stack for LLMs is extremely inefficient. A significant fraction of the 100x cost difference you mentioned can be made up by using an optimized inference server like vLLM. Replicate knows about this and is working hard on improving their stack, it's just really early for all of us. :)
Check out what Matt Shumer put together as well: https://github.com/mshumer/gpt-llm-trainer.
I have used his trainer for auto distillation of GPT-4 into GPT3.5 fine tunes, but plan to do the same for Llama as well.
Cheers!
From what I've read and personally experimented with, none of the Llama 2 models are well-suited to translation in particular (they were mainly trained on English data). Still, there are a number of tasks that they're really good at if fine-tuned correctly, such as classification and data extraction.
> I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).
I think you're definitely right about that, and in most cases just using GPT 3.5 for one-off tasks makes the most sense. I think when you get into production workflows that scale, that's when using a small fine-tuned models starts making more sense. You can drop the system prompt and get data in the format you'd expect it in, and train on GPT-4's output to sometimes get better accuracy than 3.5 would give you right off the bat. And keep in mind, while you can do the same thing with a fine-tuned 3.5 model, it's going to cost 8x the base 3.5 price per token.
I was surprised by how fast it runs on an M2 MBP + llama.cpp; Way way faster than ChatGPT, and that's not even using the Apple neural engine.
You said 50-1000 examples.
Do I fine-tune when having specific q/a sets like from real customers and I want to add the right answer to the model?
Do I fine-tune facts or should I use some lookup?
Does adding some code and API docs for a current version of something I want more support make sense? Like chatgpt knows quarkus 2 but not quarkus 3
I believe this because LLaMa-2 13B is more than good enough to handle what I call "quick search", i.e.
``` User: "What's the weather in Milwaukee?"
System: Here's some docs, answer concisely in one sentence.
AI: It's 73 degrees Farenheit. ```
YMMV on cost still, depends on cloud vendor, and my intuition agrees with yours: GPT-3.5 is priced low enough that there isn't a case where it makes sense to use another model. It strikes me now that's there's a good reason for that intuition: OpenAI's $/GPU hour is likely <= any other vendor's and inference time of LLaMa 2 ~= GPT.
I do think this will change with local LLMs. They've been way over-hyped for months, but after LLaMa 2, the challenges remaining are more sociological than technical.
For months now it's been one-off $LATEST_BUZZY_MODEL.c stunts that run on desktop.
The vast majority of the _actual_ usage and progress is coming from porn-y stuff, and the investment occurs in one-off stunts.
That split of effort, and lack of engineering rigor, is stunting progress overall.
Microsoft has LLaMa-2 ONNX available on GitHub[1]. There's budding but very small projects in different languages to wrap ONNX. Once there's a genuine cross-platform[2] ONNX wrapper that makes running LLaMa-2 easy, there will be a step change. It'll be "free"[3] to run your fine-tuned model that does as well as GPT-4.
It's not clear to me exactly when this will occur. It's "difficult" now, but only because the _actual usage_ in the local LLM community doesn't have a reason to invest in ONNX, and it's extremely intimidating to figure out how exactly to get LLaMa-2 running in ONNX. Microsoft kinda threw it up on GitHub and moved on, the sample code even still needs a PyTorch model. I see at least one very small company on HuggingFace that _may_ have figured out full ONNX.
Funnily enough, ONNX is getting a spike in mindshare over the last month in the _Stable Diffusion_ community. There's decent cross-pollination between local art and local LLMs, ex. LoRA's were first a thing for Stable Diffusion. So I'm hoping we see this sooner rather than later.
[1] https://github.com/microsoft/Llama-2-Onnx
[2] Definition of cross-platform matters a ton here, what I mean is "I can import $ONNX_WRAPPER_LIB on iOS / Android / Mac / Windows and call Llama2.reply(String prompt, ...)"
[3] Runs on somebody else's computer, where "somebody else" is the user, instead of a cloud vendor.
cat new_data.txt | finetune model.file > new_model.fileAs you go up the hierarchy what you want is higher quality answers to more and more abstract and general questions.
AGI, God, CEOs, and figures like Paul Graham, Elon Musk etc.. all answer to various degrees the ultimate abstract question of "What is the meaning of gestures wildly at everything"
Cost efficiency and commoditization basically increases "how" capacity at the cost of "why" capacity
So the comparison would be the cost of renting a cloud GPU to run Llama vs querying ChatGPT.
In general, fine-tuning helps a model figure out how to do the exact task that is being done in the examples it's given. So fine-tuning it on 1000 examples of an API being used in the wild is likely to teach it to use that API really effectively, but fine-tuning it on just the API docs probably won't.
That said, there are a lot of interesting ideas floating around on how to most effectively teach a model purely from instructions like API docs. Powerful models like GPT-4 can figure it out from in-context learning (ie. if you paste in a page of API docs and ask GPT-4 to write something with the API it can usually do a decent job). I suspect the community will figure out techniques either through new training objectives or synthetic training data to do it for smaller fine-tuned models as well.
Llama 2 can also pick the function call format up, given sufficient training data that contains function call responses, though you'll then have to parse the returned object out of the text-based response.
ToS is unenforceable and irrelevant to anyone that's in this space
Are fine-tuning datasets required to be input/output pairs? Or instead, can the fine-tuning be autoregressive (predict the next token throughout this corpus of unlabeled documents)?
For foreign language corrections ("correct this German sentence and give a reason for the correction"), GPT-3.5 doesn't quite have the horsepower so I use GPT-4
There is even a crowdsourced version of the UI like artbot: https://lite.koboldai.net/#
And there are some excellent extant finetuning frameworks, like Aoxotol, that run on consumer GPUs: https://github.com/OpenAccess-AI-Collective/axolotl
IIRC Text-gen-ui had a QLORA finetuning UI too.
What I am saying is that its already like Stable Diffusion, but the community is just somewhat under the radar, and finetuning will never be quite as turnkey as dreambooth/sd 1.5 LORA due to the nature of the training data.
Longer-term, we'd love to expand the selection of base models to include specialized LLMs that are particularly good at a certain task, e.g. language translation, and let you train off of those as well. Providing a ton of specialized starting models will decrease the amount of training data you need, and increase the number of tasks at which fine-tuned models can excel.
You either need a backend with good batching support (vLLM), or if you don't need much throughput, an extremely low end GPU or no GPU at all for exLlama/llama.cpp.
OpenAI benefits from quantization/batching, optimized kernels and very high utilization on their end, so the huge price gap vs a default HF Transformers instance is understandable. But even then, you are probably right about their aggressive pricing.
As for quality, you need a llama model finetunes on the target language (many already exist on Huggingface) and possibly custom grammar if your backend supports it.
You'll never get actual economics out of switching to open models without running your own hardware. That's the whole point. There's orders of magnitude difference in price, where a single V100/3090 instance can run llama2-70b inference for ~0.50$/hr.
Yes, and it doesn't even come close. Llama2-70b can run inference at 300+tokens/s on a single V100 instance at ~$0.50/hr. Anyone who can should be switching away from OpenAI right now.
As a practical matter though, most of the fine-tuning frameworks, including Axolotl (which this guide uses) and HuggingFace's SFTTrainer (the actual fine-tuning trainer most frameworks use under the hood) assume your data comes in input/output pairs, and automatically inserts a separator token to let the model know that the input has finished and it should start generating the output. In general most tasks can be formulated this way, including autocomplete tasks, so I'd probably recommend going that way unless you have a very strong reason not to.
But if you need text generation and are ok with a 7B+ parameter model, Llama 2 or one of its derivatives is what I'd strongly recommend. The community around it is much larger than any of the alternatives so the tooling is better, and it's either state of the art or close to it on all evals when compared to other similarly-sized open models.
If you're comfortable sharing more details of the task you're trying to do I might be able to give more specific advice.
(Disclaimer: I work in the cloud organization at Microsoft, and these are totally my own thoughts and opinions and don't reflect any kind of inside knowledge I have. I think I can say that provisioning LLM capacity and GPU's is something we basically all have a tremendous amount of passion about.)
So you'll have to figure out how to run/scale the model inference. Cloud GPU instances are generally very expensive, and once you start needing to horizontally scale it'll get messy fast.
At least at the moment it's expensive, especially if it's either very light usage or very intensive usage - you either need just a few seconds of compute occasionally, or lots of compute all the time requiring scaling.
The "lucky" ones in this scenario are small-medium businesses that can use one or a few cards on-site for their traffic. Even then when you take the cost of an A100 + maintaining it, etc. OpenAI's offering still looks attractive.
I know there's a few services that try to provide an api similar to what openai has, and some software to self orchestrate it, I'm curious how those compare...
This guy used gradient.ai and he has a Google Collab to try it
It's cheaper than the ELECTRICITY cost of running a llama-70 on your own M1.Max (very energy efficient chip) assuming free hardware.
I guess they are also getting a pretty good cache hit rate - there are only so many questions people ask at scale. But still, it's dumping.
I just don't see it.
It depends a lot on what you're trying to do. If have a focused use case of the type of fine-tuning you want, you can probably get away with one of the smaller models.
Another thing to look out for is Retrieval Augmented Generation (RAG). I don't see it in wide use yet, but it may turn out to more useful than fine tuning for a lot of situations.
hacker news pantheon just dropped
Which is your go to?
They have lots of money now and the market lead. They want to keep the lead and some extra electricity and hardware costs are surely worth it for them, if it keeps the competition from getting traction.
That's an exercise left to the reader for now, and is where your value/moat lies.
It's more than fast enough for my experiments and the laptop doesn't seem to break a sweat.
see https://tvm.apache.org/docs/how_to/deploy/android.html
or https://octoml.ai/blog/using-swift-and-apache-tvm-to-develop...
Hopefully more on-demand services enter the space. Currently where I am we don't have the resources for any type of self orchestration and our use case is so low/sporadic that we can't simply have a dedicated instance.
Last I saw the current services were rather expensive but I should recheck.
For a couple dozen languages, GPT-4 is by far the best translator you can get your hand on so basically no.
Are there any well directed courses available?
It gets expensive fast, but not messy, these things scale horizontally really well. All the state is encapsulated in the request, no replication, synchronisation, user data to worry about. I'd rather have the job of horizontally scaling llama2 than a relational database.
From what I've read 4090 should blow A100 away if you can fit within 22GB VRAM, which a 7B model should comfortably.
And the latency (along with variability and availability) on OpenAI API is terrible because of the load they are getting.
OpenAI aren't doing anything magic. We're optimizing Llama inference at the moment and it looks like we'll be able to roughly match GPT 3.5's price for Llama 2 70B.
Running a fine-tuned GPT-3.5 is surprisingly expensive. That's where using Llama makes a ton of sense. Once we’ve optimized inference, it’ll be much cheaper to run a fine-tuned Llama.
My thing is that dynamically doing that is still a lot compared to just calling a single endpoint and all of that is handled for you.
But for sure this is a very decent horizontal use-case.
The pricing on OpenPipe says it's 0.0012 to 0.0016 per 1K tokens for Llama 7b. GPT-3.5 pricing is 0.0015 to 0.002, so not that different.
I'm assuming the 50x cost reductions are primarily from self-hosting?
Very inflated statement when it comes to GPT4 since it is a MoE model with 8 separate models each an expert in one area, and you can't replace all 8 models with one model trained for $19.
I call BS on this claim. Maybe it matches GPT4 in the narrow domain you fine-tune it for, and if that can be done for $19 then for $19*8 you can take OpenAI out of business. That doesn't add up.
Maybe start this way from the ground up, so you can get modular units, for health, finance, programming, education, writting assitance, phyloophy, ethics etc etc. If the modules can be changed, then one might be able to reduce their seize. Ea pick 2 or 3 chain them and one has a LLM for a specific area of interest. (reducing running cost)
For autocomplete tasks, with a corpus of unlabeled documents, would you insert a separator token at an arbitrary space in each document, in order to form input/output pairs?
Right now it's basically a chat bot that you can use to practice conversing with. It provides corrections for the things you type. Eventually I'd like to try adding Whisper as well to allow users to speak out loud.
When you hover over a word, you get a translation. Initially I thought using Open AI for every word translation would be too much, but I've been able to get it down to ~36-40 tokens/request. (3-4 cents/1000 requests). I also began parsing and uploading some of this [Wiktionary data](https://kaikki.org/dictionary/rawdata.html) and am working on a feature that integrates the GPT-3.5 translation with this Wiktionary data.
A lot of these features are still in the works but you can feel free to try it if you like (https://trytutor.app).
I built two such a systems after burning that much in a week on ChatGPT.
Core speed and memory bandwidth matter a lot. This is on a Ryzen 7950 with DDR5.
Do you believe Microsoft can actually make the same promises and keep them? You don't have to answer the last question, of course, but please think about it. It doesn't matter where the LLM is located but who controls it and who holds the resulting data.
* Chenbro Rackmount 4U Server Chassis RM42300-F (rack mount case Remove the air filter on 120mm fan. Put two decent 80mm exhaust at rear). * Two used air cooled 3090s. About $650 a piece on ebay. Check slot width and make sure everything will fit on your motherboard. Do a burn in when you get them cause used GPUs can be hit or miss. * 5950x CPU (overkill just had it) * 128GB DDR4 * Motherboard with x570 chipset and dual pcie x16. These will birificate to x8 pcie 4.0 lanes to each GPU. This is enough bandwidth to push GPUs to max IME * 1200W+ ATX power supply. * ebay "u.2 pcie 3.84TB" and adaptor for m.2 NVME slot. (again what I had & it is cheap)
If you're going to really beat the thing I would power limit the 3090s to 320w (from 350w). Perf change is not really notable and keeps temps better.
I refer you to https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-c... for an "easy" to digest summary
https://en.m.wikipedia.org/wiki/Trap_street
Also, it seems sort of like how cryptocurrency folks assumed their transactions were anonymous? It's an API, so they could log the calls. (Maybe not the contents.)
What are you doing!?
If you have an experiences to share, successes or failures, please do.
Then there isn't anything in particular which makes their model(s) stand out. On the contrary, they seem rather inefficient, which is probably reflected on the inference cost this gargantuan conglomerate takes to run.
Can I train it further using the project source to let the model "understand" the project context more?
"The CLOUD Act asserts that U.S. data and communication companies must provide stored data for a customer or subscriber on any server they own and operate when requested by warrant, but provides mechanisms for the companies or the courts to reject or challenge these if they believe the request violates the privacy rights of the foreign country the data is stored in."
Can you share your system specs? I was looking into something similar but my costs were closer to 6 to 8k for the whole system.
But you are totally correct about the pricing part it can get expensive
I’m running this photo service https://msdosimagetools.ngrok.dev/
Its doing 200+ photos every day and I’m using open source models behind the scene on replicate. My costs increasing day by day
Plus this is hosted locally
- Fine tuning: Difficult, time-consuming, slow, takes time to add new information, costs a lot more.
- RAG: Can be free if you use free options like Chrome, Weaviate, or Postgres with Vector Plugin. Really fast. Once you set it up, you just need to upload a document, and it's available for GPT to answer with.
I'm using RAG for a client right now, and it was a breeze. Really easy, especially if you use something like Langchain. Compared to fine-tuning, it's a lot easier, cheaper, and faster...
- Llama-2-70B: $1 (on Anyscale Endpoints [1]) - GPT-3.5-turbo: $1.50 - $2 (OpenAI [2])
[1] https://app.endpoints.anyscale.com/ [2] https://openai.com/pricing
TBC, I probably could have optimized tokens but contract was profitable and time critical.
For further reference you can lookup "next-token prediction objective".
"Completion" format only takes a single text value per dataset record. Some other formats are in the form of multiple choice answers, etc.
Take a look below (there are more formats in "see other formats") https://github.com/OpenAccess-AI-Collective/axolotl#dataset
----- From TheUnamusedFox, in August: > 3090 down to ~260-270 watts (from 400) with minimal gen speed impact. Same with a 3080ti. It seems to be more stable with image generation than gaming, at least on my two cards. If I try to game or benchmark with this undervolt it is an instant crash.
From another user:
> this undervolting stuff is pretty sweet. > undervolted_limits.png [1] > max_power_limits.png [2] > this is my before and after. > a solid 200 watt drop for only 9.2% loss of performance > not to mention the 30 degree drop in temps
[1]: https://cdn.discordapp.com/attachments/1143237412663869570/1... [2]: https://cdn.discordapp.com/attachments/1143237412663869570/1...