zlacker

[parent] [thread] 6 comments
1. moonch+(OP)[view] [source] 2023-09-12 17:45:01
We are talking about 7B models ? Those can run on consumer GPUs with lower latency than A100s AFAIK (because gaming GPUs are clocked different).

Not to mention OpenAI has shit latency and terrible reliability - you should be using Azure models if you care about that - but pricing is also higher.

I would say fixed costs and development time is on openai side but I've seen people post great practical comparisons for latency and cost using hostes fine-tuned small models.

replies(2): >>minima+L >>7spete+z7
2. minima+L[view] [source] 2023-09-12 17:47:35
>>moonch+(OP)
"Running" and "acceptable inference speed and quality" are two different constraints, particularly at scale/production.
replies(1): >>moonch+0X
3. 7spete+z7[view] [source] 2023-09-12 18:21:41
>>moonch+(OP)
When you say it can run on consumer gpus, do you mean pretty much just the 4090/3090 or can it run on lesser cards?
replies(2): >>halfli+ng >>gsuuon+Fu
◧◩
4. halfli+ng[view] [source] [discussion] 2023-09-12 18:46:59
>>7spete+z7
I was able to run the 4bit quantized LLAMA2 7B on a 2070 Super, though latency was so-so.

I was surprised by how fast it runs on an M2 MBP + llama.cpp; Way way faster than ChatGPT, and that's not even using the Apple neural engine.

replies(1): >>hereon+jM
◧◩
5. gsuuon+Fu[view] [source] [discussion] 2023-09-12 19:32:52
>>7spete+z7
Quantized 7B's can comfortably run with 8GB vram
◧◩◪
6. hereon+jM[view] [source] [discussion] 2023-09-12 20:32:22
>>halfli+ng
It runs fantastically well on M2 Mac + llama.cpp, such a variety of factors in the Apple hardware making it possible. The ARM fp16 vector intrinsics, the Macbook's AMX co-processor, the unified memory architecture, etc.

It's more than fast enough for my experiments and the laptop doesn't seem to break a sweat.

◧◩
7. moonch+0X[view] [source] [discussion] 2023-09-12 21:13:20
>>minima+L
I don't understand what you're trying to say ?

From what I've read 4090 should blow A100 away if you can fit within 22GB VRAM, which a 7B model should comfortably.

And the latency (along with variability and availability) on OpenAI API is terrible because of the load they are getting.

[go to top]