zlacker

[parent] [thread] 6 comments
1. ramesh+(OP)[view] [source] 2023-09-12 19:30:14
>So the comparison would be the cost of renting a cloud GPU to run Llama vs querying ChatGPT.

Yes, and it doesn't even come close. Llama2-70b can run inference at 300+tokens/s on a single V100 instance at ~$0.50/hr. Anyone who can should be switching away from OpenAI right now.

replies(2): >>thewat+je >>chepts+VEs
2. thewat+je[view] [source] 2023-09-12 20:16:10
>>ramesh+(OP)
What's the best way to use LLama2-70b without existing infrastructure for orchestrating it?
replies(3): >>ramesh+kg >>mjirv+MB >>pdntsp+zZ
◧◩
3. ramesh+kg[view] [source] [discussion] 2023-09-12 20:23:08
>>thewat+je
>What's the best way to use LLama2-70b without existing infrastructure for orchestrating it?

That's an exercise left to the reader for now, and is where your value/moat lies.

replies(1): >>thewat+Jj
◧◩◪
4. thewat+Jj[view] [source] [discussion] 2023-09-12 20:37:21
>>ramesh+kg
> That's an exercise left to the reader for now, and is where your value/moat lies.

Hopefully more on-demand services enter the space. Currently where I am we don't have the resources for any type of self orchestration and our use case is so low/sporadic that we can't simply have a dedicated instance.

Last I saw the current services were rather expensive but I should recheck.

◧◩
5. mjirv+MB[view] [source] [discussion] 2023-09-12 21:50:45
>>thewat+je
I stumbled upon OpenRouter[0] a few days ago. Easiest I’ve seen by far (if you want SaaS, not hosting it yourself).

[0] https://openrouter.ai

◧◩
6. pdntsp+zZ[view] [source] [discussion] 2023-09-13 00:13:32
>>thewat+je
I bought an old server off ServerMonkey for like $700 with a stupid amount of RAM and CPUs and it runs Llama2-70b fine, if a little slowly. Good for experimenting
7. chepts+VEs[view] [source] 2023-09-21 12:47:26
>>ramesh+(OP)
How do you fit Llama2-70b into V100? V100 is 16GB. Llama2-70b 4bit would require up to 40GB. Also, what do you use for inference to get 300+tokens/s?
[go to top]