zlacker

>For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.

You'll never get actual economics out of switching to open models without running your own hardware. That's the whole point. There's orders of magnitude difference in price, where a single V100/3090 instance can run llama2-70b inference for ~0.50$/hr.

replies(1): >>YetAno+J5

>>ramesh+(OP)
No, they can't run it. llama 70 with 4 bit quantization takes ~50 GB VRAM for decent enough context size. You need A100, or 2-3 V100 or 4 3090 which all costs roughly roughly $3-5/h

replies(1): >>ramesh+a7

>>YetAno+J5
Wrong. I am running 8bit GGML with 24GB VRAM on a single 4090 with 2048 context right now

replies(1): >>YetAno+x7

>>ramesh+a7
Which model? I am talking about 70b as mentioned clearly. 70b 8b is 70GB just for the model itself. How much token/second are you getting with single 4090?

replies(1): >>ramesh+G8

>>YetAno+x7
Offloading 40% of layers to CPU, about 50t/s with 16 threads.

replies(2): >>pocket+nh >>jpdus+La1

>>ramesh+G8
That is more than an order of magnitude better than my experience; I get around 2 t/s with similar hardware. I had also seen others reporting similar figures to mine so I assumed it was normal. Is there a secret to what you're doing?

replies(1): >>ramesh+sM

>>pocket+nh
>Is there a secret to what you're doing?

Core speed and memory bandwidth matter a lot. This is on a Ryzen 7950 with DDR5.

>>ramesh+G8
Care to share your detailed stack and command to reach 50t/s? I also have a 7950 with DDR 5 and I don't even get 50 t/s on my two RTX 4090s....