zlacker

I thought Llama was opensource/free and you could run it yourself?

replies(4): >>kuchen+d2 >>loudma+33 >>thewat+ue >>axpy90+wX

>>Muffin+(OP)
Compute costs money.

>>Muffin+(OP)
You can run the smaller Llama variants on consumer grade hardware, but people typically rent GPUs from the cloud to run the larger variants. It is possible to run even larger variants on a beefy workstation or gaming rig, but the performance on consumer hardware usually makes this impractical.

So the comparison would be the cost of renting a cloud GPU to run Llama vs querying ChatGPT.

replies(1): >>ramesh+ac

>>loudma+33
>So the comparison would be the cost of renting a cloud GPU to run Llama vs querying ChatGPT.

Yes, and it doesn't even come close. Llama2-70b can run inference at 300+tokens/s on a single V100 instance at ~$0.50/hr. Anyone who can should be switching away from OpenAI right now.

replies(2): >>thewat+tq >>chepts+5Rs

>>Muffin+(OP)
You (currently) need a GPU to run any of the useful models. I haven't really seen a business use-case that runs it on the user's computer, but given the hardware requirements it wouldn't be very feasible to expect.

So you'll have to figure out how to run/scale the model inference. Cloud GPU instances are generally very expensive, and once you start needing to horizontally scale it'll get messy fast.

At least at the moment it's expensive, especially if it's either very light usage or very intensive usage - you either need just a few seconds of compute occasionally, or lots of compute all the time requiring scaling.

The "lucky" ones in this scenario are small-medium businesses that can use one or a few cards on-site for their traffic. Even then when you take the cost of an A100 + maintaining it, etc. OpenAI's offering still looks attractive.

I know there's a few services that try to provide an api similar to what openai has, and some software to self orchestrate it, I'm curious how those compare...

replies(1): >>hereon+kD

>>ramesh+ac
What's the best way to use LLama2-70b without existing infrastructure for orchestrating it?

replies(3): >>ramesh+us >>mjirv+WN >>pdntsp+Jb1

>>thewat+tq
>What's the best way to use LLama2-70b without existing infrastructure for orchestrating it?

That's an exercise left to the reader for now, and is where your value/moat lies.

replies(1): >>thewat+Tv

>>ramesh+us
> That's an exercise left to the reader for now, and is where your value/moat lies.

Hopefully more on-demand services enter the space. Currently where I am we don't have the resources for any type of self orchestration and our use case is so low/sporadic that we can't simply have a dedicated instance.

Last I saw the current services were rather expensive but I should recheck.

>>thewat+ue
> once you start needing to horizontally scale it'll get messy fast.

It gets expensive fast, but not messy, these things scale horizontally really well. All the state is encapsulated in the request, no replication, synchronisation, user data to worry about. I'd rather have the job of horizontally scaling llama2 than a relational database.

replies(1): >>thewat+4H

>>hereon+kD
For sure, and yeah it wouldn't be terrible you're right. You'd just need the api servers + a load balancer.

My thing is that dynamically doing that is still a lot compared to just calling a single endpoint and all of that is handled for you.

But for sure this is a very decent horizontal use-case.

>>thewat+tq
I stumbled upon OpenRouter[0] a few days ago. Easiest I’ve seen by far (if you want SaaS, not hosting it yourself).

[0] https://openrouter.ai

>>Muffin+(OP)
Unfortunately, Lama2 is not a fully open source license.

>>thewat+tq
I bought an old server off ServerMonkey for like $700 with a stupid amount of RAM and CPUs and it runs Llama2-70b fine, if a little slowly. Good for experimenting

>>ramesh+ac
How do you fit Llama2-70b into V100? V100 is 16GB. Llama2-70b 4bit would require up to 40GB. Also, what do you use for inference to get 300+tokens/s?