zlacker

What Mistral has though is speed, and with speed comes scale.

>>sjwhev+(OP)
Who cares about speed if you’re wrong?

This isn’t a race to write the most lines of code or the most lines of text. It’s a race to write the most correct lines of code.

I’ll wait half an hour for a response if I know I’m getting at least staff engineer level tier of code for every question

replies(4): >>popinm+i3 >>sjwhev+X9 >>ein0p+3c >>dathin+UB

>>spacem+B1
For the tasks my group is considering, even a 7B model is adequate.

Sufficiently accurate responses can be fed into other systems downstream and cleaned up. Even code responses can benefit from this by restricting output tokens using the grammar of the target language, or iterating until the code compiles successfully.

And for a decent number of LLM-enabled use cases the functionality unlocked by these models is novel. When you're going from 0 to 1 people will just be amazed that the product exists.

>>spacem+B1
Who says it’s wrong? I have very discrete tasks which involve resolving linguistic ambiguity and they can perform very well.

replies(1): >>mlnj+9j

>>spacem+B1
That’s the correct answer. Years ago I worked on inference efficiency on edge hardware at a startup. Time after time I saw that users vastly prefer slower, but more accurate and robust systems. Put succinctly: nobody cares how quick a model is if it doesn’t do a good job. Another thing I discovered is it can be very difficult to convince software engineers of this obvious fact.

replies(3): >>spacec+Rg >>Al-Khw+Eo >>sjwhev+T62

>>ein0p+3c
Having spent time on edge compute projects. This.

Also, all the evidence is in this thread. Clearly people unhappy with wasting time on LLMs, when the time that was wasted was the result of obviously bad output.

replies(1): >>sjwhev+o62

>>sjwhev+X9
Exactly. Not everything is throwing large chunks of text to get complex questions answered.

I love using the smaller models like Starling LM 7B and Mistral 7B have been enough for many tasks like you mentioned.

>>ein0p+3c
Less compute also means lower cost, though.

I see how most people would prefer a better but slower model when price is equal, but I'm sure many prefer a worse $2/mo model over a better $20/mo model.

replies(1): >>ein0p+bg1

>>spacem+B1
Who care about getting better answers if you can't afford it, can't use it for legal reason or conclude that the risk associated with OpenAI now being a fully proprietary US based service only company is to high given all circumstances. (Depending on how various things develop things like US export restricting OpenAI, even GPT-4, is a very real possibility companies can't ignore when doing long term product decisions.)

>>Al-Khw+Eo
That’s the thing I’m finding so hard to explain. Nobody would ever pay even $2 for a system that is worse at solving the problem. There is some baseline compute you need to deliver certain types of models. Going below that level for lower cost at the expense of accuracy and robustness is a fool’s errand.

In LLMs it’s even worse. To make it concrete, for how I use LLMs I will not only not pay for anything with less capability than GPT4, I won’t even use it for free. It could be that other LLMs could perform well on narrow problems after fine tuning, but even then I’d prefer the model with the highest metrics, not the lowest inference cost.

replies(1): >>sjwhev+A72

>>sjwhev+(OP)
I’ll wait 5 seconds for the right code over 1 sec for bad code.

replies(1): >>sjwhev+U72

>>spacec+Rg
People think LLM are all or nothing, like it’s either god-like AGI or it’s useless “hallucinating”.

In reality you have to know the strengths and weaknesses of any tool, and small/fast LLM can do a tremendous amount within a fixed scope. The people at Mistral get this.

>>ein0p+3c
Yes, but for certain classes of problems small LLM are highly performant - in many cases equal to a GPT-4, which sure can do more things well, but adding 2+2 is gonna be 4 no matter what. You don’t need a tank to drive to the grocery store, just a small car with a trunk.

So the assertion that small models aren’t as good just isn’t correct. They are amazing at certain things, and are incredibly faster and cheaper than larger models.

>>ein0p+bg1
So I think that’s a “your problem isn’t right for the tool” issue, not a “Mistral isn’t capable” issue.

replies(1): >>ein0p+rc2

>>huyter+mk1
Yes but if a 7b LLM will give you the same “Hello World” as the 70b, and that’s literally all you need, using a bigger model is just burning energy for no reason at all.

replies(1): >>huyter+Hq2

>>sjwhev+A72
It isn’t capable unless you have a very specialized task and carefully fine tune to solve just that task. GPT4 covers a lot of ground out of the box. The best model I’ve seen so far on the FOSS side, Mixtral MoE, is less capable than even GPT 3.5. I often submit my requests to both Mixtral and GPT4. If I’m problem solving (learning something, working with code, summarizing, working on my messaging) Mixtral is nearly always a waste of time in comparison.

replies(1): >>sjwhev+YF2

>>sjwhev+U72
The cost is fixed for me at least at this point so why would I choose the inferior version.

replies(1): >>sjwhev+NE2

>>huyter+Hq2
It’s not fixed whatsoever. Mistral 7B runs on a MacBook Air, and it’s free. Zero cost LLM, no network latency.

>>ein0p+rc2
Again, that’s precisely what I’m saying. A bounded task is best executed against the smallest possible model at the greatest possible speed. This is true for business factors ($$$) as well as environmental (smaller model -> less carbon).

LLM are not AGI, they are tools that have specific uses we are still discovering.

If you aren’t trying to optimize your accuracy to start with and just saying “I’ll run the most expensive thing and assume it is better” with zero evaluation you’re wasting money, time, and hurting the environment.

Also, I don’t even like running Mistral if I can avoid it - a lot of tasks can be done with a fine tune of BERT or DistilBERT. It takes more work but my custom BERT models way outperform GPT-4 on bounded tasks because I have highly curated training data.

Within specialized domains you just aren’t going to see GPT-4/5/6 performing on par with expert curated data.