zlacker

But no one cares about 3.5. It’s an order of magnitude worse than 4. An order of magnitude is a lot harder to catch up with.

replies(6): >>roody1+d >>sjwhev+d4 >>danpal+Vg >>nl+Rl >>epolan+2s >>dathin+HD

>>huyter+(OP)
Yeah but for how long… at this rate I would expect some of the freely distributed models to hit gpt4 levels in as little as 3-6 months.

replies(2): >>int_19+V7 >>huyter+kv2

>>huyter+(OP)
What Mistral has though is speed, and with speed comes scale.

replies(2): >>spacem+O5 >>huyter+zo1

>>sjwhev+d4
Who cares about speed if you’re wrong?

This isn’t a race to write the most lines of code or the most lines of text. It’s a race to write the most correct lines of code.

I’ll wait half an hour for a response if I know I’m getting at least staff engineer level tier of code for every question

replies(4): >>popinm+v7 >>sjwhev+ae >>ein0p+gg >>dathin+7G

>>spacem+O5
For the tasks my group is considering, even a 7B model is adequate.

Sufficiently accurate responses can be fed into other systems downstream and cleaned up. Even code responses can benefit from this by restricting output tokens using the grammar of the target language, or iterating until the code compiles successfully.

And for a decent number of LLM-enabled use cases the functionality unlocked by these models is novel. When you're going from 0 to 1 people will just be amazed that the product exists.

>>roody1+d
I've heard claims like that 6 months ago.

But so far nobody is even in the same ballpark. And not just freely distributed models, but proprietary ones backed by big money, as well.

It really makes one wonder what kind of secret sauce OpenAI has. Surely it can't just be all that compute that Microsoft bought them, since Google could easily match that, and yet...

replies(1): >>qetern+VU

>>spacem+O5
Who says it’s wrong? I have very discrete tasks which involve resolving linguistic ambiguity and they can perform very well.

replies(1): >>mlnj+mn

>>spacem+O5
That’s the correct answer. Years ago I worked on inference efficiency on edge hardware at a startup. Time after time I saw that users vastly prefer slower, but more accurate and robust systems. Put succinctly: nobody cares how quick a model is if it doesn’t do a good job. Another thing I discovered is it can be very difficult to convince software engineers of this obvious fact.

replies(3): >>spacec+4l >>Al-Khw+Rs >>sjwhev+6b2

>>huyter+(OP)
Many products don’t expose chat directly to the user. For example auto categorisation of my bank transactions does not need GPT-4, and small model with a little fine tuning will do well, and massively outperform any other classification. There are many problems like this.

>>ein0p+gg
Having spent time on edge compute projects. This.

Also, all the evidence is in this thread. Clearly people unhappy with wasting time on LLMs, when the time that was wasted was the result of obviously bad output.

replies(1): >>sjwhev+Ba2

>>huyter+(OP)
This isn't true. Lots of people care deeply and use 3.5 levels of performance at some point in their software stack.

For lots of applications the speed/quality/price trade offs make a lot of sense.

For example if you are doing vanilla question answering over lots of documents then 3.5 or Mixtral are better than GPT4 because the speed is important.

replies(1): >>huyter+wx5

>>sjwhev+ae
Exactly. Not everything is throwing large chunks of text to get complex questions answered.

I love using the smaller models like Starling LM 7B and Mistral 7B have been enough for many tasks like you mentioned.

>>huyter+(OP)
That really depends on the use case.

For some advanced reasoning you're 100% right, but many times you're doing document conversion, summarizing, doing RAG, in all these cases GPT 3.5 performs as good if not better than GPT 4 (we can't ignore cost and speed) and it's very hard to distinguish between the two.

replies(1): >>darkwa+vv

>>ein0p+gg
Less compute also means lower cost, though.

I see how most people would prefer a better but slower model when price is equal, but I'm sure many prefer a worse $2/mo model over a better $20/mo model.

replies(1): >>ein0p+ok1

>>epolan+2s
I would dare to say that in general most people need every day help on more simple tasks rather than complex reasoning. Now obviously, if you get complex reasoning at the same speed and cost of simpler tasks, it's a no-brainer. But if there are trade-offs...

>>huyter+(OP)
people do care in various ways

- price per thing you use it with matters (a lot)

- making sure that under no circumstances are the involved information leaked (included being trained on) matters a lot in many use cases, while OpenAI does by now have supports that the degree of you being able to enforce it is not enough for some use cases. In some cases this is a hard constraint due to legal regulations.

- geo politics matters, sometimes. Being dependent on a US service is sometimes a no go (using self hosted US software is most times fine, tho). Even if you only operate in the EU.

- it's much easier to domain adapt if the model is source/weight accessible in a reasonable degree, while GPT-4 has a fine tuning API it's much much less powerful a direct consequence of the highly proprietary nature of GPT-4

- a lot of companies are not happy at all if they become highly reliable on a single service which can change at any time in how it acts, the pricing model or it being available in your country at all. So basing your product on a less powerful but in turn replaceable or open source AI can be a good idea, especially if you are based in a country not at best terms with the US.

- do you trust Sam Altman at all? I do not and it seem short sighted to do so. In which case some of the points above become more relevant

- 3.5 level especially in combination with domain adoption can be "good enough" for some use cases

>>spacem+O5
Who care about getting better answers if you can't afford it, can't use it for legal reason or conclude that the risk associated with OpenAI now being a fully proprietary US based service only company is to high given all circumstances. (Depending on how various things develop things like US export restricting OpenAI, even GPT-4, is a very real possibility companies can't ignore when doing long term product decisions.)

>>int_19+V7
> But so far nobody is even in the same ballpark.

Miqu is pretty good. Sure, it's a leak...but there's nothing special there. It's just a 70b llama2 finetune.

replies(1): >>int_19+Is2

>>Al-Khw+Rs
That’s the thing I’m finding so hard to explain. Nobody would ever pay even $2 for a system that is worse at solving the problem. There is some baseline compute you need to deliver certain types of models. Going below that level for lower cost at the expense of accuracy and robustness is a fool’s errand.

In LLMs it’s even worse. To make it concrete, for how I use LLMs I will not only not pay for anything with less capability than GPT4, I won’t even use it for free. It could be that other LLMs could perform well on narrow problems after fine tuning, but even then I’d prefer the model with the highest metrics, not the lowest inference cost.

replies(1): >>sjwhev+Nb2

>>sjwhev+d4
I’ll wait 5 seconds for the right code over 1 sec for bad code.

replies(1): >>sjwhev+7c2

>>spacec+4l
People think LLM are all or nothing, like it’s either god-like AGI or it’s useless “hallucinating”.

In reality you have to know the strengths and weaknesses of any tool, and small/fast LLM can do a tremendous amount within a fixed scope. The people at Mistral get this.

>>ein0p+gg
Yes, but for certain classes of problems small LLM are highly performant - in many cases equal to a GPT-4, which sure can do more things well, but adding 2+2 is gonna be 4 no matter what. You don’t need a tank to drive to the grocery store, just a small car with a trunk.

So the assertion that small models aren’t as good just isn’t correct. They are amazing at certain things, and are incredibly faster and cheaper than larger models.

>>ein0p+ok1
So I think that’s a “your problem isn’t right for the tool” issue, not a “Mistral isn’t capable” issue.

replies(1): >>ein0p+Eg2

>>huyter+zo1
Yes but if a 7b LLM will give you the same “Hello World” as the 70b, and that’s literally all you need, using a bigger model is just burning energy for no reason at all.

replies(1): >>huyter+Uu2

>>sjwhev+Nb2
It isn’t capable unless you have a very specialized task and carefully fine tune to solve just that task. GPT4 covers a lot of ground out of the box. The best model I’ve seen so far on the FOSS side, Mixtral MoE, is less capable than even GPT 3.5. I often submit my requests to both Mixtral and GPT4. If I’m problem solving (learning something, working with code, summarizing, working on my messaging) Mixtral is nearly always a waste of time in comparison.

replies(1): >>sjwhev+bK2

>>qetern+VU
By the standards of other llama2 finetunes, sure. Compared to GPT-4, I stand by my previous assertion.

>>sjwhev+7c2
The cost is fixed for me at least at this point so why would I choose the inferior version.

replies(1): >>sjwhev+0J2

>>roody1+d
Order of magnitude means they’re going to take 20 times longer to get to the 4. So maybe on the order of 40-60 months from this point.

>>huyter+Uu2
It’s not fixed whatsoever. Mistral 7B runs on a MacBook Air, and it’s free. Zero cost LLM, no network latency.

>>ein0p+Eg2
Again, that’s precisely what I’m saying. A bounded task is best executed against the smallest possible model at the greatest possible speed. This is true for business factors ($$$) as well as environmental (smaller model -> less carbon).

LLM are not AGI, they are tools that have specific uses we are still discovering.

If you aren’t trying to optimize your accuracy to start with and just saying “I’ll run the most expensive thing and assume it is better” with zero evaluation you’re wasting money, time, and hurting the environment.

Also, I don’t even like running Mistral if I can avoid it - a lot of tasks can be done with a fine tune of BERT or DistilBERT. It takes more work but my custom BERT models way outperform GPT-4 on bounded tasks because I have highly curated training data.

Within specialized domains you just aren’t going to see GPT-4/5/6 performing on par with expert curated data.

>>nl+Rl
It’s a price issue because 3.5 and 4 response times are about the same for me.