Fine-tune your own Llama 2 to replace GPT-3.5/4

>>kcorbi+(OP)
> Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!

These comparisons are reductive to the point of being misleading. Even with all the optimizations in the ecosystem, it's not trivial to get a finetuned 7B param model running at an acceptable inference latency. Even if you use a GPU such as an A100 for maximum speed, then you have scalability issues since A100s are scarce. Also, the "50% cheaper" assumes 100% utilization of a GPU which will never happen in production use cases.

Quality-wise, a finetuned Llama 2 is not necessairly better than ChatGPT. Finetuning requires a high-quality dataset which is not easy to construct. And in my own experience with finetuning Llama 2, qualitivately it caused more frustration to get outputs on par with just using ChatGPT.

The value of the ChatGPT API is more dependable scaling and not having to pay for an infra.

>>minima+08
Doesn't this depend a lot on your application though? Not every workload needs low latency and massive horizontal scalability.

Take their example of running the llm over the 2 million recipes and saving $23k over GPT 4. That could easily be 2 million documents in some back end system running in a batch. Many people would wait a few days or weeks for a job like that to finish if it offered significant savings.

>>hereon+qf
That's more of a fair use case.

It though also demonstrates why the economics are complicated and there's no one-size-fits-all.

zlacker