zlacker

We need a new word, not "local model" but "my own computers model" CapEx based

This distinction is important because some "we support local model" tools have things like ollama orchestration or use the llama.cpp libraries to connect to models on the same physical machine.

That's not my definition of local. Mine is "local network". so call it the "LAN model" until we come up with something better. "Self-host" exists but this usually means more "open-weights" as opposed to clamping the performance of the model.

It should be defined as ~sub-$10k, using Steve Jobs megapenny unit.

Essentially classify things as how many megapennies of spend a machine is that won't OOM on it.

That's what I mean when I say local: running inference for 'free' somewhere on hardware I control that's at most single digit thousands of dollars. And if I was feeling fancy, could potentially fine-tune on the days scale.

A modern 5090 build-out with a threadripper, nvme, 256GB RAM, this will run you about 10k +/- 1k. The MLX route is about $6000 out the door after tax (m3-ultra 60 core with 256GB).

Lastly it's not just "number of parameters". Not all 32B Q4_K_M models load at the same rate or use the same amount of memory. The internal architecture matters and the active parameter count + quantization is becoming a poorer approximation given the SOTA innovations.

What might be needed is some standardized eval benchmark against standardized hardware classes with basic real world tasks like toolcalling, code generation, and document procesing. There's plenty of "good enough" models out there for a large category of every day tasks, now I want to find out what runs the best

Take a gen6 thinkpad P14s/macbook pro and a 5090/mac studio, run the benchmark and then we can say something like "time-to-first-token/token-per-second/memory-used/total-time-of-test" and rate this as independent from how accurate the model was.

replies(7): >>bigyab+x7 >>echelo+ua >>christ+Ac >>zozbot+Cf >>opencl+Mr >>mrklol+Aq1 >>estima+p42

>>kristo+(OP)
OOM is a pretty terrible benchmark too, though. You can build a DDR4 machine that "technically" loads 256gb models for maybe $1000 used, but then you've got to account for the compute aspect and that's constrained by a number of different variables. A super-sparse model might run great on that DDR4 machine, whereas a 32b model would cause it to chug.

There's just not a good way to visualize the compute needed, with all the nuance that exists. I think that trying to create these abstractions are what leads to people impulse buying resource-constrained hardware and getting frustrated. The autoscalers have a huge advantage in this field that homelabbers will never be able to match.

replies(1): >>French+tc

>>kristo+(OP)
I don't even need "open weights" to run on hardware I own.

I am fine renting an H100 (or whatever), as long as I theoretically have access to and own everything running.

I do not want my career to become dependent upon Anthropic.

Honestly, the best thing for "open" might be for us to build open pipes and services and models where we can rent cloud. Large models will outpace small models: LLMs, video models, "world" models, etc.

I'd even be fine time-sharing a running instance of a large model in a large cloud. As long as all the constituent pieces are open where I could (in theory) distill it, run it myself, spin up my own copy, etc.

I do not deny that big models are superior. But I worry about the power the large hyperscalers are getting while we focus on small "open" models that really can't match the big ones.

We should focus on competing with large models, not artisanal homebrew stuff that is irrelevant.

replies(1): >>Aurorn+un

>>bigyab+x7
> time-to-first-token/token-per-second/memory-used/total-time-of-test

Would it not help with the DDR4 example though if we had more "real world" tests?

replies(1): >>bigyab+ne

>>kristo+(OP)
I won't need a heater with that running in my room.

replies(2): >>hedora+jM >>wincy+xh1

>>French+tc
Maybe, but even that fourth-order metric is missing key performance details like context length and model size/sparsity.

The bigger takeaway (IMO) is that there will never really be hardware that scales like Claude or ChatGPT does. I love local AI, but it stresses the fundamental limits of on-device compute.

>>kristo+(OP)
You can run plenty of models on a $10K machine or even a lot less than that, it all depends how much you want to wait for results. Streaming weights from SSD storage using mmap() is already a reality when running the largest and sparsest models. You can save even more on memory by limiting KV caching at the cost of extra compute, and there may be ways to push RAM savings even higher simply by tweaking the extent to which model activations are recomputed as needed.

replies(1): >>kristo+Fi

>>zozbot+Cf
Yeah there's a lot of people that advocate for really slow inference on cheap infra. That's something else that should be expressed in this fidelity

Because honestly I don't care about 0.2 tps for my use cases although I've spoken with many who are fine with numbers like that.

At least the people I've talked to they talk about how if they have a very high confidence score that the model will succeed they don't mind the wait.

Essentially a task failure is 1 in 10, I want to monitor and retry.

If it's 1 in 1000, then I can walk away.

The reality is most people don't have a bearing on what this order of magnitude actually is for a given task. So unless you have high confidence in your confidence score, slow is useless

But sometimes you do...

replies(1): >>zozbot+jk

>>kristo+Fi
If you launch enough tasks in parallel you aren't going to care that 1 in 10 failed, as long as the other 9 are good. Just rerun the failed job whenever you get around to it, the infra will still be getting plenty of utilization on the rest.

>>echelo+ua
> I do not want my career to become dependent upon Anthropic

As someone who switches between Anthropic and ChatGPT depending on the month and has dabbled with other providers and some local LLMs, I think this fear is unfounded.

It's really easy to switch between models. The different models have some differences that you notice over time but the techniques you learn in one place aren't going to lock you into a provider anywhere.

replies(3): >>airstr+np >>echelo+4r >>mrklol+Oq1

>>Aurorn+un
right, but ChatGPT might not exist at some point, and if we don't force feed the open inference ecosystem and infrastructure back into the mouths of the AI devourer that is this hype cycle, we'll simply be accepting our inevitable, painful death

replies(2): >>christ+yu >>Aurorn+fU

>>Aurorn+un
> It's really easy to switch between models. The different models have some differences that you notice over time but the techniques you learn in one place aren't going to lock you into a provider anywhere.

We have two cell phone providers. Google is removing the ability to install binaries, and the other one has never allowed freedom. All computing is taxed, defaults are set to the incumbent monopolies. Searching, even for trademarks, is a forced bidding war. Businesses have to shed customer relationships, get poached on brand relationships, and jump through hoops week after week. The FTC/DOJ do nothing, and the EU hasn't done much either.

I can't even imagine what this will be like for engineering once this becomes necessary to do our jobs. We've been spoiled by not needing many tools - other industries, like medical or industrial research, tie their employment to a physical location and set of expensive industrial tools. You lose your job, you have to physically move - possibly to another state.

What happens when Anthropic and OpenAI ban you? Or decide to only sell to industry?

This is just the start - we're going to become more dependent upon these tools to the point we're serfs. We might have two choices, and that's demonstrably (with the current incumbency) not a good world.

Computing is quickly becoming a non-local phenomenon. Google and the platforms broke the dream of the open web. We're about to witness the death of the personal computer if we don't do anything about it.

replies(1): >>pseudo+IM1

>>kristo+(OP)
For context on what cloud API costs look like when running coding agents:

With Claude Sonnet at $3/$15 per 1M tokens, a typical agent loop with ~2K input tokens and ~500 output per call, 5 LLM calls per task, and 20% retry overhead (common with tool use): you're looking at roughly $0.05-0.10 per agent task.

At 1K tasks/day that's ~$1.5K-3K/month in API spend.

The retry overhead is where the real costs hide. Most cost comparisons assume perfect execution, but tool-calling agents fail parsing, need validation retries, etc. I've seen retry rates push effective costs 40-60% above baseline projections.

Local models trading 50x slower inference for $0 marginal cost start looking very attractive for high-volume, latency-tolerant workloads.

replies(3): >>taneq+BB >>pstuar+fQ >>jychan+J01

>>airstr+np
If they die there will be so much hardware released to do other tasks.

replies(1): >>echelo+zw

>>christ+yu
Perhaps not tasks you get the opportunity to do.

Your job might be assigned to some other legal entity renting some other compute.

If this goes as according to some of their plans, we might all be out of the picture one day.

If these systems are closed, you might not get the opportunity to hire them yourself to build something you have ownership in. You might be cut out.

>>opencl+Mr
At this point isn’t the marginal cost based on power consumption? At 30c/kWh and with a beefy desktop pc pulling up to half a kW, that’s 15c/hr. For true zero marginal cost, maybe get solar panels. :P

replies(1): >>EGreg+zR

>>christ+Ac
This looks like it’ll run easily on a Strix Halo (180W TDP), and be a little sluggish on previous gen AMDs (80W TDP).

I can’t be bothered to check TDPs on 64GB macbooks, but none of these devices really count as space heaters.

>>opencl+Mr
Might there be a way to leverage local models just to help minimize the retries -- doing the tool calling handling and giving the agent "perfect execution"?

I'm a noob and am asking as wishful thinking.

replies(1): >>jermau+fl2

>>taneq+BB
This is an interesting question actually!

Marginal cost includes energy usage but also I burned out a MacBook GPU with vanity-eth last year so wear-and-tear is also a cost.

>>airstr+np
> right, but ChatGPT might not exist at some point

There are multiple frontier models to choose from.

They’re not all going to disappear.

replies(3): >>airstr+MZ >>hahajk+H71 >>Bukhma+Tb1

>>Aurorn+fU
right, and the less we rely on ChatGPT and Claude, the more we give power to "all other frontier models", which right now have very, very little market share

>>opencl+Mr
On the other hand, Deepseek V3.2 is $0.38 per million tokens output. And on openrouter, most providers serve it at 20 tokens/sec.

At 20t/s over 1 month, that's... $19something running literally 24/7. In reality it'd be cheaper than that.

I bet you'd burn more than $20 in electricity with a beefy machine that can run Deepseek.

The economics of batch>1 inference does not go in favor of consumers.

replies(1): >>selcuk+9p1

>>Aurorn+fU
the companies could merge or buy each other

>>Aurorn+fU
This seems absurdly naive to me with the path big tech has taken in the last 5 years. There’s literally infinite upside and almost no downside to constraining the ecosystem for the big players.

You don’t think that eventually Google/OpenAI are going to go to the government and say, “it’s really dangerous to have all these foreign/unreglated models being used everywhere could you please get rid of them?”. Suddenly they have an oligopoly on the market.

>>christ+Ac
Haha running OSS-120B on my 5090 with most of the layers in video memory, some in RAM with LM Studio, I was hard pressed to get it to actually use anywhere near the full 600W. Gaming in 4K playing a modern game generates substantially more sustained heat.

>>jychan+J01
> At 20t/s over 1 month, that's... $19something running literally 24/7.

You can run agents in parallel, but yeah, that's a fair comparison.

>>kristo+(OP)
I mean if it’s running in your lan, isn’t it local? :D

>>Aurorn+un
Because they make it easy. Imagine they limit their models to their tooling and suddenly it’s introducing work.

>>echelo+4r
I just don’t see it.

I mean, the long arch of computing history has had us wobble back and forth in regards to how closed down it all was, but it seems we are almost at a golden age again with respect to good enough (if not popular) hardware.

On the software front, we definitely swung back from the age of Microsoft. Sure, Linux is a lot more corporate than people admit, but it’s a lot more open than Microsoft’s offerings and it’s capable of running on practically everything except the smallest IOT device.

As for LLMs. I know people have hyped themselves up to think that if you aren’t chasing the latest LLM release and running swarms of agents, you are next in the queues for the soup kitchens, but again, I don’t see why it HAS to play out that way, partly because of history (as referenced), partly because open models are already so impressive and I don’t see any reason why they wouldn’t continue to do well.

In fact, I do my day-to-day work using an open weight model. Beyond that, can only say I know employers who will probably never countenance using commercially hosted LLMs, but who are already setting up self-hosted ones based on open weight releases.

replies(1): >>Orygin+992

>>kristo+(OP)
Local as in localhost

replies(1): >>helly2+cH3

>>pseudo+IM1
> but it seems we are almost at a golden age again with respect to good enough (if not popular) hardware.

I don't think we're in any golden age since the GPU shortages started, and now memory and disks are becoming super expensive too.

Hardware vendors have shown they don't have an interest in serving consumers and will sell out to hyperscalers the moment they show some green bills. I fear a day where you won't be able to purchase powerful (enough) machines and will be forced to subscribe to a commercial provider to get some compute to do your job.

>>pstuar+fQ
> I'm a noob and am asking as wishful thinking.

Don't minimize your thoughts! Outside voices and naive questions sometimes provide novel insights that might be dismissed, but someone might listen.

I've not done this exactly, but I have setup "chains" that create a fresh context for tool calls so their call chains don't fill the main context. There is no reason why the Tool Calls couldn't be redirected to another LLM endpoint (local for instance). Especially with something like gpt-oss-20b, where I've found executing tools happens at a higher success than claude sonnet via openrouter.

>>estima+p42
Local!

I do not mind the cost honestly. And a bit slower also works. I just use one older mac ultra 2/192G ram and another with an rtx5060/16G and an and r9700/32G. Between those I get my models working fine.

That also gives me full privacy. And that is worth way way way more than any cost.