zlacker

> Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!

These comparisons are reductive to the point of being misleading. Even with all the optimizations in the ecosystem, it's not trivial to get a finetuned 7B param model running at an acceptable inference latency. Even if you use a GPU such as an A100 for maximum speed, then you have scalability issues since A100s are scarce. Also, the "50% cheaper" assumes 100% utilization of a GPU which will never happen in production use cases.

Quality-wise, a finetuned Llama 2 is not necessairly better than ChatGPT. Finetuning requires a high-quality dataset which is not easy to construct. And in my own experience with finetuning Llama 2, qualitivately it caused more frustration to get outputs on par with just using ChatGPT.

The value of the ChatGPT API is more dependable scaling and not having to pay for an infra.

replies(3): >>moonch+K2 >>kcorbi+54 >>hereon+q7

>>minima+(OP)
We are talking about 7B models ? Those can run on consumer GPUs with lower latency than A100s AFAIK (because gaming GPUs are clocked different).

Not to mention OpenAI has shit latency and terrible reliability - you should be using Azure models if you care about that - but pricing is also higher.

I would say fixed costs and development time is on openai side but I've seen people post great practical comparisons for latency and cost using hostes fine-tuned small models.

replies(2): >>minima+v3 >>7spete+ja

>>moonch+K2
"Running" and "acceptable inference speed and quality" are two different constraints, particularly at scale/production.

replies(1): >>moonch+KZ

>>minima+(OP)
We're finding that when running Llama-2-7B with vLLM (https://github.com/vllm-project/vllm) on an A40 GPU we're getting consistently lower time-to-first-token and lower average token generation time than GPT-3.5, even when processing multiple requests in parallel. A40s are pretty easy to get your hands on these days (much easer than A100s anyway).

The 50x cheaper (that's 2% of the cost, not 50% of the cost) number does assume 100% GPU utilization, which may or may not be realistic for your use case. If you're doing batch processing as part of a data pipeline, which is not an unusual use case, you can run your GPU at 100% utilization and turn it off when the batch finishes.

If you've got a highly variable workload then you're right, you'll have much lower utilization numbers. But if you work with an aggregator that can quickly hot swap LoRA fine-tunes (as a disclaimer, my company OpenPipe works in this space) you can get back a lot of that lost efficiency since we can increase/decrease GPU capacity only when our aggregate usage changes, which smooths things out.

>>minima+(OP)
Doesn't this depend a lot on your application though? Not every workload needs low latency and massive horizontal scalability.

Take their example of running the llm over the 2 million recipes and saving $23k over GPT 4. That could easily be 2 million documents in some back end system running in a batch. Many people would wait a few days or weeks for a job like that to finish if it offered significant savings.

replies(1): >>minima+i8

>>hereon+q7
That's more of a fair use case.

It though also demonstrates why the economics are complicated and there's no one-size-fits-all.

>>moonch+K2
When you say it can run on consumer gpus, do you mean pretty much just the 4090/3090 or can it run on lesser cards?

replies(2): >>halfli+7j >>gsuuon+px

>>7spete+ja
I was able to run the 4bit quantized LLAMA2 7B on a 2070 Super, though latency was so-so.

I was surprised by how fast it runs on an M2 MBP + llama.cpp; Way way faster than ChatGPT, and that's not even using the Apple neural engine.

replies(1): >>hereon+3P

>>7spete+ja
Quantized 7B's can comfortably run with 8GB vram

>>halfli+7j
It runs fantastically well on M2 Mac + llama.cpp, such a variety of factors in the Apple hardware making it possible. The ARM fp16 vector intrinsics, the Macbook's AMX co-processor, the unified memory architecture, etc.

It's more than fast enough for my experiments and the laptop doesn't seem to break a sweat.

>>minima+v3
I don't understand what you're trying to say ?

From what I've read 4090 should blow A100 away if you can fit within 22GB VRAM, which a 7B model should comfortably.

And the latency (along with variability and availability) on OpenAI API is terrible because of the load they are getting.