zlacker

[return to "Fine-tune your own Llama 2 to replace GPT-3.5/4"]
1. minima+08[view] [source] 2023-09-12 17:33:02
>>kcorbi+(OP)
> Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!

These comparisons are reductive to the point of being misleading. Even with all the optimizations in the ecosystem, it's not trivial to get a finetuned 7B param model running at an acceptable inference latency. Even if you use a GPU such as an A100 for maximum speed, then you have scalability issues since A100s are scarce. Also, the "50% cheaper" assumes 100% utilization of a GPU which will never happen in production use cases.

Quality-wise, a finetuned Llama 2 is not necessairly better than ChatGPT. Finetuning requires a high-quality dataset which is not easy to construct. And in my own experience with finetuning Llama 2, qualitivately it caused more frustration to get outputs on par with just using ChatGPT.

The value of the ChatGPT API is more dependable scaling and not having to pay for an infra.

◧◩
2. moonch+Ka[view] [source] 2023-09-12 17:45:01
>>minima+08
We are talking about 7B models ? Those can run on consumer GPUs with lower latency than A100s AFAIK (because gaming GPUs are clocked different).

Not to mention OpenAI has shit latency and terrible reliability - you should be using Azure models if you care about that - but pricing is also higher.

I would say fixed costs and development time is on openai side but I've seen people post great practical comparisons for latency and cost using hostes fine-tuned small models.

◧◩◪
3. minima+vb[view] [source] 2023-09-12 17:47:35
>>moonch+Ka
"Running" and "acceptable inference speed and quality" are two different constraints, particularly at scale/production.
◧◩◪◨
4. moonch+K71[view] [source] 2023-09-12 21:13:20
>>minima+vb
I don't understand what you're trying to say ?

From what I've read 4090 should blow A100 away if you can fit within 22GB VRAM, which a 7B model should comfortably.

And the latency (along with variability and availability) on OpenAI API is terrible because of the load they are getting.

[go to top]