Fine-tune your own Llama 2 to replace GPT-3.5/4

>>kcorbi+(OP)
For translation jobs, I've experimented with Llama 2 70B (running on Replicate) v/s GPT-3.5;

For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.

Llama 7B wasn't up to the task fyi, producing very poor translations.

I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).

I'm curious to see if others have gotten different results?

>>ronyfa+wk
You can run 70B LLAMA on dual 4090s/3090s with quantization. Going with dual 3090s you can get a system that can run LLAMA 2 70B with 12K context for < $2K.

I built two such a systems after burning that much in a week on ChatGPT.

>>ttt3ts+aq1
Would you mind to share all your PC HW (mobo, casing, cooling, etc) for this dual GPU configuration? Thanks.

>>zakki+Ds1
The one you could build for under 2K is last gen hardware.

* Chenbro Rackmount 4U Server Chassis RM42300-F (rack mount case Remove the air filter on 120mm fan. Put two decent 80mm exhaust at rear). * Two used air cooled 3090s. About $650 a piece on ebay. Check slot width and make sure everything will fit on your motherboard. Do a burn in when you get them cause used GPUs can be hit or miss. * 5950x CPU (overkill just had it) * 128GB DDR4 * Motherboard with x570 chipset and dual pcie x16. These will birificate to x8 pcie 4.0 lanes to each GPU. This is enough bandwidth to push GPUs to max IME * 1200W+ ATX power supply. * ebay "u.2 pcie 3.84TB" and adaptor for m.2 NVME slot. (again what I had & it is cheap)

If you're going to really beat the thing I would power limit the 3090s to 320w (from 350w). Perf change is not really notable and keeps temps better.

zlacker