Fine-tune your own Llama 2 to replace GPT-3.5/4

>>kcorbi+(OP)
> Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!

These comparisons are reductive to the point of being misleading. Even with all the optimizations in the ecosystem, it's not trivial to get a finetuned 7B param model running at an acceptable inference latency. Even if you use a GPU such as an A100 for maximum speed, then you have scalability issues since A100s are scarce. Also, the "50% cheaper" assumes 100% utilization of a GPU which will never happen in production use cases.

Quality-wise, a finetuned Llama 2 is not necessairly better than ChatGPT. Finetuning requires a high-quality dataset which is not easy to construct. And in my own experience with finetuning Llama 2, qualitivately it caused more frustration to get outputs on par with just using ChatGPT.

The value of the ChatGPT API is more dependable scaling and not having to pay for an infra.

>>minima+08
We're finding that when running Llama-2-7B with vLLM (https://github.com/vllm-project/vllm) on an A40 GPU we're getting consistently lower time-to-first-token and lower average token generation time than GPT-3.5, even when processing multiple requests in parallel. A40s are pretty easy to get your hands on these days (much easer than A100s anyway).

The 50x cheaper (that's 2% of the cost, not 50% of the cost) number does assume 100% GPU utilization, which may or may not be realistic for your use case. If you're doing batch processing as part of a data pipeline, which is not an unusual use case, you can run your GPU at 100% utilization and turn it off when the batch finishes.

If you've got a highly variable workload then you're right, you'll have much lower utilization numbers. But if you work with an aggregator that can quickly hot swap LoRA fine-tunes (as a disclaimer, my company OpenPipe works in this space) you can get back a lot of that lost efficiency since we can increase/decrease GPU capacity only when our aggregate usage changes, which smooths things out.

zlacker