zlacker

[return to "Fine-tune your own Llama 2 to replace GPT-3.5/4"]
1. ronyfa+wk[view] [source] 2023-09-12 18:29:55
>>kcorbi+(OP)
For translation jobs, I've experimented with Llama 2 70B (running on Replicate) v/s GPT-3.5;

For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.

Llama 7B wasn't up to the task fyi, producing very poor translations.

I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).

I'm curious to see if others have gotten different results?

◧◩
2. brucet+3D[view] [source] 2023-09-12 19:24:52
>>ronyfa+wk
TBH, Replicate is not a great way to run 7B beyond experimentation. You want a host with cheap consumer GPUs (like vast.ai) since the 4-bit requirements are so modest.

You either need a backend with good batching support (vLLM), or if you don't need much throughput, an extremely low end GPU or no GPU at all for exLlama/llama.cpp.

OpenAI benefits from quantization/batching, optimized kernels and very high utilization on their end, so the huge price gap vs a default HF Transformers instance is understandable. But even then, you are probably right about their aggressive pricing.

As for quality, you need a llama model finetunes on the target language (many already exist on Huggingface) and possibly custom grammar if your backend supports it.

[go to top]