I believe this because LLaMa-2 13B is more than good enough to handle what I call "quick search", i.e.
``` User: "What's the weather in Milwaukee?"
System: Here's some docs, answer concisely in one sentence.
AI: It's 73 degrees Farenheit. ```
YMMV on cost still, depends on cloud vendor, and my intuition agrees with yours: GPT-3.5 is priced low enough that there isn't a case where it makes sense to use another model. It strikes me now that's there's a good reason for that intuition: OpenAI's $/GPU hour is likely <= any other vendor's and inference time of LLaMa 2 ~= GPT.
I do think this will change with local LLMs. They've been way over-hyped for months, but after LLaMa 2, the challenges remaining are more sociological than technical.
For months now it's been one-off $LATEST_BUZZY_MODEL.c stunts that run on desktop.
The vast majority of the _actual_ usage and progress is coming from porn-y stuff, and the investment occurs in one-off stunts.
That split of effort, and lack of engineering rigor, is stunting progress overall.
Microsoft has LLaMa-2 ONNX available on GitHub[1]. There's budding but very small projects in different languages to wrap ONNX. Once there's a genuine cross-platform[2] ONNX wrapper that makes running LLaMa-2 easy, there will be a step change. It'll be "free"[3] to run your fine-tuned model that does as well as GPT-4.
It's not clear to me exactly when this will occur. It's "difficult" now, but only because the _actual usage_ in the local LLM community doesn't have a reason to invest in ONNX, and it's extremely intimidating to figure out how exactly to get LLaMa-2 running in ONNX. Microsoft kinda threw it up on GitHub and moved on, the sample code even still needs a PyTorch model. I see at least one very small company on HuggingFace that _may_ have figured out full ONNX.
Funnily enough, ONNX is getting a spike in mindshare over the last month in the _Stable Diffusion_ community. There's decent cross-pollination between local art and local LLMs, ex. LoRA's were first a thing for Stable Diffusion. So I'm hoping we see this sooner rather than later.
[1] https://github.com/microsoft/Llama-2-Onnx
[2] Definition of cross-platform matters a ton here, what I mean is "I can import $ONNX_WRAPPER_LIB on iOS / Android / Mac / Windows and call Llama2.reply(String prompt, ...)"
[3] Runs on somebody else's computer, where "somebody else" is the user, instead of a cloud vendor.
see https://tvm.apache.org/docs/how_to/deploy/android.html
or https://octoml.ai/blog/using-swift-and-apache-tvm-to-develop...