zlacker

[parent] [thread] 3 comments
1. 7spete+(OP)[view] [source] 2023-09-12 18:21:41
When you say it can run on consumer gpus, do you mean pretty much just the 4090/3090 or can it run on lesser cards?
replies(2): >>halfli+O8 >>gsuuon+6n
2. halfli+O8[view] [source] 2023-09-12 18:46:59
>>7spete+(OP)
I was able to run the 4bit quantized LLAMA2 7B on a 2070 Super, though latency was so-so.

I was surprised by how fast it runs on an M2 MBP + llama.cpp; Way way faster than ChatGPT, and that's not even using the Apple neural engine.

replies(1): >>hereon+KE
3. gsuuon+6n[view] [source] 2023-09-12 19:32:52
>>7spete+(OP)
Quantized 7B's can comfortably run with 8GB vram
◧◩
4. hereon+KE[view] [source] [discussion] 2023-09-12 20:32:22
>>halfli+O8
It runs fantastically well on M2 Mac + llama.cpp, such a variety of factors in the Apple hardware making it possible. The ARM fp16 vector intrinsics, the Macbook's AMX co-processor, the unified memory architecture, etc.

It's more than fast enough for my experiments and the laptop doesn't seem to break a sweat.

[go to top]