zlacker

Wrong. I am running 8bit GGML with 24GB VRAM on a single 4090 with 2048 context right now

replies(1): >>YetAno+n

>>ramesh+(OP)
Which model? I am talking about 70b as mentioned clearly. 70b 8b is 70GB just for the model itself. How much token/second are you getting with single 4090?

replies(1): >>ramesh+w1

>>YetAno+n
Offloading 40% of layers to CPU, about 50t/s with 16 threads.

replies(2): >>pocket+da >>jpdus+B31

>>ramesh+w1
That is more than an order of magnitude better than my experience; I get around 2 t/s with similar hardware. I had also seen others reporting similar figures to mine so I assumed it was normal. Is there a secret to what you're doing?

replies(1): >>ramesh+iF

>>pocket+da
>Is there a secret to what you're doing?

Core speed and memory bandwidth matter a lot. This is on a Ryzen 7950 with DDR5.

>>ramesh+w1
Care to share your detailed stack and command to reach 50t/s? I also have a 7950 with DDR 5 and I don't even get 50 t/s on my two RTX 4090s....