This isn’t a race to write the most lines of code or the most lines of text. It’s a race to write the most correct lines of code.
I’ll wait half an hour for a response if I know I’m getting at least staff engineer level tier of code for every question
Sufficiently accurate responses can be fed into other systems downstream and cleaned up. Even code responses can benefit from this by restricting output tokens using the grammar of the target language, or iterating until the code compiles successfully.
And for a decent number of LLM-enabled use cases the functionality unlocked by these models is novel. When you're going from 0 to 1 people will just be amazed that the product exists.
But so far nobody is even in the same ballpark. And not just freely distributed models, but proprietary ones backed by big money, as well.
It really makes one wonder what kind of secret sauce OpenAI has. Surely it can't just be all that compute that Microsoft bought them, since Google could easily match that, and yet...
Also, all the evidence is in this thread. Clearly people unhappy with wasting time on LLMs, when the time that was wasted was the result of obviously bad output.
For lots of applications the speed/quality/price trade offs make a lot of sense.
For example if you are doing vanilla question answering over lots of documents then 3.5 or Mixtral are better than GPT4 because the speed is important.
I love using the smaller models like Starling LM 7B and Mistral 7B have been enough for many tasks like you mentioned.
For some advanced reasoning you're 100% right, but many times you're doing document conversion, summarizing, doing RAG, in all these cases GPT 3.5 performs as good if not better than GPT 4 (we can't ignore cost and speed) and it's very hard to distinguish between the two.
I see how most people would prefer a better but slower model when price is equal, but I'm sure many prefer a worse $2/mo model over a better $20/mo model.
- price per thing you use it with matters (a lot)
- making sure that under no circumstances are the involved information leaked (included being trained on) matters a lot in many use cases, while OpenAI does by now have supports that the degree of you being able to enforce it is not enough for some use cases. In some cases this is a hard constraint due to legal regulations.
- geo politics matters, sometimes. Being dependent on a US service is sometimes a no go (using self hosted US software is most times fine, tho). Even if you only operate in the EU.
- it's much easier to domain adapt if the model is source/weight accessible in a reasonable degree, while GPT-4 has a fine tuning API it's much much less powerful a direct consequence of the highly proprietary nature of GPT-4
- a lot of companies are not happy at all if they become highly reliable on a single service which can change at any time in how it acts, the pricing model or it being available in your country at all. So basing your product on a less powerful but in turn replaceable or open source AI can be a good idea, especially if you are based in a country not at best terms with the US.
- do you trust Sam Altman at all? I do not and it seem short sighted to do so. In which case some of the points above become more relevant
- 3.5 level especially in combination with domain adoption can be "good enough" for some use cases
Miqu is pretty good. Sure, it's a leak...but there's nothing special there. It's just a 70b llama2 finetune.
In LLMs it’s even worse. To make it concrete, for how I use LLMs I will not only not pay for anything with less capability than GPT4, I won’t even use it for free. It could be that other LLMs could perform well on narrow problems after fine tuning, but even then I’d prefer the model with the highest metrics, not the lowest inference cost.
In reality you have to know the strengths and weaknesses of any tool, and small/fast LLM can do a tremendous amount within a fixed scope. The people at Mistral get this.
So the assertion that small models aren’t as good just isn’t correct. They are amazing at certain things, and are incredibly faster and cheaper than larger models.
LLM are not AGI, they are tools that have specific uses we are still discovering.
If you aren’t trying to optimize your accuracy to start with and just saying “I’ll run the most expensive thing and assume it is better” with zero evaluation you’re wasting money, time, and hurting the environment.
Also, I don’t even like running Mistral if I can avoid it - a lot of tasks can be done with a fine tune of BERT or DistilBERT. It takes more work but my custom BERT models way outperform GPT-4 on bounded tasks because I have highly curated training data.
Within specialized domains you just aren’t going to see GPT-4/5/6 performing on par with expert curated data.