zlacker

[parent] [thread] 3 comments
1. kcorbi+(OP)[view] [source] 2023-09-12 18:39:03
Yes, if you're just using Llama 2 off the shelf (without fine-tuning) I don't think there are a lot of workloads where it makes sense as a replacement for GPT-3.5. The one exception being for organizations where data security is non-negotiable and they really need to host on-prem. The calculus changes drastically though when you bring fine-tuning in, which lets a much smaller model outperform a larger one on many classes of task.

Also, it's worth noting that Replicate started out with a focus on image generation, and their current inference stack for LLMs is extremely inefficient. A significant fraction of the 100x cost difference you mentioned can be made up by using an optimized inference server like vLLM. Replicate knows about this and is working hard on improving their stack, it's just really early for all of us. :)

replies(1): >>bfirsh+DJ
2. bfirsh+DJ[view] [source] 2023-09-12 21:13:49
>>kcorbi+(OP)
Founder of Replicate here. It's early indeed.

OpenAI aren't doing anything magic. We're optimizing Llama inference at the moment and it looks like we'll be able to roughly match GPT 3.5's price for Llama 2 70B.

Running a fine-tuned GPT-3.5 is surprisingly expensive. That's where using Llama makes a ton of sense. Once we’ve optimized inference, it’ll be much cheaper to run a fine-tuned Llama.

replies(2): >>yixu34+1R1 >>Dowwie+Eu3
◧◩
3. yixu34+1R1[view] [source] [discussion] 2023-09-13 06:05:28
>>bfirsh+DJ
We're working on LLM Engine (https://llm-engine.scale.com) at Scale, which is our open source, self-hostable framework for open source LLM inference and fine-tuning. We have similar findings to Replicate: Llama 2 70B can be comparable to GPT 3.5 price, etc. Would be great to discuss this further!
◧◩
4. Dowwie+Eu3[view] [source] [discussion] 2023-09-13 17:14:34
>>bfirsh+DJ
How heavy of a lift is it to optimize inference?
[go to top]