Fine-tune your own Llama 2 to replace GPT-3.5/4

>>kcorbi+(OP)
For translation jobs, I've experimented with Llama 2 70B (running on Replicate) v/s GPT-3.5;

For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.

Llama 7B wasn't up to the task fyi, producing very poor translations.

I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).

I'm curious to see if others have gotten different results?

>>ronyfa+wk
For use cases well within the capabilities of an LLM from last year, fine-tuned LLaMa 2 13B should/will blow ChatGPT out of the water: think "rate the sentiment of this text from 0-10".

I believe this because LLaMa-2 13B is more than good enough to handle what I call "quick search", i.e.

``` User: "What's the weather in Milwaukee?"

System: Here's some docs, answer concisely in one sentence.

AI: It's 73 degrees Farenheit. ```

YMMV on cost still, depends on cloud vendor, and my intuition agrees with yours: GPT-3.5 is priced low enough that there isn't a case where it makes sense to use another model. It strikes me now that's there's a good reason for that intuition: OpenAI's $/GPU hour is likely <= any other vendor's and inference time of LLaMa 2 ~= GPT.

I do think this will change with local LLMs. They've been way over-hyped for months, but after LLaMa 2, the challenges remaining are more sociological than technical.

For months now it's been one-off $LATEST_BUZZY_MODEL.c stunts that run on desktop.

The vast majority of the _actual_ usage and progress is coming from porn-y stuff, and the investment occurs in one-off stunts.

That split of effort, and lack of engineering rigor, is stunting progress overall.

Microsoft has LLaMa-2 ONNX available on GitHub[1]. There's budding but very small projects in different languages to wrap ONNX. Once there's a genuine cross-platform[2] ONNX wrapper that makes running LLaMa-2 easy, there will be a step change. It'll be "free"[3] to run your fine-tuned model that does as well as GPT-4.

It's not clear to me exactly when this will occur. It's "difficult" now, but only because the _actual usage_ in the local LLM community doesn't have a reason to invest in ONNX, and it's extremely intimidating to figure out how exactly to get LLaMa-2 running in ONNX. Microsoft kinda threw it up on GitHub and moved on, the sample code even still needs a PyTorch model. I see at least one very small company on HuggingFace that _may_ have figured out full ONNX.

Funnily enough, ONNX is getting a spike in mindshare over the last month in the _Stable Diffusion_ community. There's decent cross-pollination between local art and local LLMs, ex. LoRA's were first a thing for Stable Diffusion. So I'm hoping we see this sooner rather than later.

[1] https://github.com/microsoft/Llama-2-Onnx

[2] Definition of cross-platform matters a ton here, what I mean is "I can import $ONNX_WRAPPER_LIB on iOS / Android / Mac / Windows and call Llama2.reply(String prompt, ...)"

[3] Runs on somebody else's computer, where "somebody else" is the user, instead of a cloud vendor.

>>refulg+ms
you already have TVM for the cross platform stuff

see https://tvm.apache.org/docs/how_to/deploy/android.html

or https://octoml.ai/blog/using-swift-and-apache-tvm-to-develop...

or https://github.com/mlc-ai/mlc-llm

>>homarp+WX
My deepest thanks, I owe you one. Overlooked this completely. & spent dozens of hours learning way too much to still fall short of understanding how to make it work in ONNX.

zlacker