Fine-tune your own Llama 2 to replace GPT-3.5/4

>>kcorbi+(OP)
For translation jobs, I've experimented with Llama 2 70B (running on Replicate) v/s GPT-3.5;

For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.

Llama 7B wasn't up to the task fyi, producing very poor translations.

I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).

I'm curious to see if others have gotten different results?

>>ronyfa+wk
>For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.

You'll never get actual economics out of switching to open models without running your own hardware. That's the whole point. There's orders of magnitude difference in price, where a single V100/3090 instance can run llama2-70b inference for ~0.50$/hr.

>>ramesh+ND
No, they can't run it. llama 70 with 4 bit quantization takes ~50 GB VRAM for decent enough context size. You need A100, or 2-3 V100 or 4 3090 which all costs roughly roughly $3-5/h

>>YetAno+wJ
Wrong. I am running 8bit GGML with 24GB VRAM on a single 4090 with 2048 context right now

>>ramesh+XK
Which model? I am talking about 70b as mentioned clearly. 70b 8b is 70GB just for the model itself. How much token/second are you getting with single 4090?

>>YetAno+kL
Offloading 40% of layers to CPU, about 50t/s with 16 threads.

>>ramesh+tM
That is more than an order of magnitude better than my experience; I get around 2 t/s with similar hardware. I had also seen others reporting similar figures to mine so I assumed it was normal. Is there a secret to what you're doing?

>>pocket+aV
>Is there a secret to what you're doing?

Core speed and memory bandwidth matter a lot. This is on a Ryzen 7950 with DDR5.

zlacker