zlacker

[parent] [thread] 6 comments
1. YetAno+(OP)[view] [source] 2023-09-12 19:46:33
No, they can't run it. llama 70 with 4 bit quantization takes ~50 GB VRAM for decent enough context size. You need A100, or 2-3 V100 or 4 3090 which all costs roughly roughly $3-5/h
replies(1): >>ramesh+r1
2. ramesh+r1[view] [source] 2023-09-12 19:50:28
>>YetAno+(OP)
Wrong. I am running 8bit GGML with 24GB VRAM on a single 4090 with 2048 context right now
replies(1): >>YetAno+O1
◧◩
3. YetAno+O1[view] [source] [discussion] 2023-09-12 19:51:47
>>ramesh+r1
Which model? I am talking about 70b as mentioned clearly. 70b 8b is 70GB just for the model itself. How much token/second are you getting with single 4090?
replies(1): >>ramesh+X2
◧◩◪
4. ramesh+X2[view] [source] [discussion] 2023-09-12 19:55:38
>>YetAno+O1
Offloading 40% of layers to CPU, about 50t/s with 16 threads.
replies(2): >>pocket+Eb >>jpdus+251
◧◩◪◨
5. pocket+Eb[view] [source] [discussion] 2023-09-12 20:24:37
>>ramesh+X2
That is more than an order of magnitude better than my experience; I get around 2 t/s with similar hardware. I had also seen others reporting similar figures to mine so I assumed it was normal. Is there a secret to what you're doing?
replies(1): >>ramesh+JG
◧◩◪◨⬒
6. ramesh+JG[view] [source] [discussion] 2023-09-12 22:42:09
>>pocket+Eb
>Is there a secret to what you're doing?

Core speed and memory bandwidth matter a lot. This is on a Ryzen 7950 with DDR5.

◧◩◪◨
7. jpdus+251[view] [source] [discussion] 2023-09-13 01:36:16
>>ramesh+X2
Care to share your detailed stack and command to reach 50t/s? I also have a 7950 with DDR 5 and I don't even get 50 t/s on my two RTX 4090s....
[go to top]