Fine-tune your own Llama 2 to replace GPT-3.5/4

>>kcorbi+(OP)
For translation jobs, I've experimented with Llama 2 70B (running on Replicate) v/s GPT-3.5;

For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.

Llama 7B wasn't up to the task fyi, producing very poor translations.

I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).

I'm curious to see if others have gotten different results?

>>ronyfa+wk
>For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.

You'll never get actual economics out of switching to open models without running your own hardware. That's the whole point. There's orders of magnitude difference in price, where a single V100/3090 instance can run llama2-70b inference for ~0.50$/hr.

>>ramesh+ND
No, they can't run it. llama 70 with 4 bit quantization takes ~50 GB VRAM for decent enough context size. You need A100, or 2-3 V100 or 4 3090 which all costs roughly roughly $3-5/h

>>YetAno+wJ
Wrong. I am running 8bit GGML with 24GB VRAM on a single 4090 with 2048 context right now

zlacker