zlacker

This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.

I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.

Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next

replies(14): >>vessen+m >>daniel+N1 >>1dom+43 >>embedd+w4 >>dust42+dd >>dehrma+wd >>organs+Zl >>segmon+1q >>dcastm+Bz >>kristo+w21 >>codazo+Ui1 >>mark_l+0s1 >>kristi+EN1 >>brianj+hc2

>>simonw+(OP)
I'm thinking the next step would be to include this as a 'junior dev' and let Opus farm simple stuff out to it. It could be local, but also if it's on cerebras, it could be realllly fast.

replies(1): >>ttoino+Y

>>vessen+m
Cerebras already has GLM 4.7 in the code plans

replies(1): >>vessen+E1

>>ttoino+Y
Yep. But this is like 10x faster; 3B active parameters.

replies(1): >>ttoino+p4

>>simonw+(OP)
It works reasonably well for general tasks, so we're definitely getting there! Probably Qwen3 CLI might be better suited, but haven't tested it yet.

>>simonw+(OP)
I run Qwen3-Coder-30B-A3B-Instruct gguf on a VM with 13gb RAM and a 6gb RTX 2060 mobile GPU passed through to it with ik_llama, and I would describe it as usable, at least. It's running on an old (5 years, maybe more) Razer Blade laptop that has a broken display and 16gb RAM.

I use opencode and have done a few toy projects and little changes in small repositories and can get pretty speedy and stable experience up to a 64k context.

It would probably fall apart if I wanted to use it on larger projects, but I've often set tasks running on it, stepped away for an hour, and had a solution when I return. It's definitely useful for smaller project, scaffolding, basic bug fixes, extra UI tweaks etc.

I don't think "usable" a binary thing though. I know you write lot about this, but it'd be interesting to understand what you're asking the local models to do, and what is it about what they do that you consider unusable on a relative monster of a laptop?

replies(3): >>regula+gn >>simonw+0E >>codedo+Hc3

>>vessen+E1
Cerebras is already 200-800 tps, do you need even faster ?

replies(1): >>overfe+Fh

>>simonw+(OP)
> I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful

I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.

I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.

replies(3): >>gigate+t8 >>andai+iM >>pocksu+1f1

>>embedd+w4
I’ve a 128GB m3 max MacBook Pro. Running the gpt oss model on it via lmstudio once the context gets large enough the fans spin to 100 and it’s unbearable.

replies(2): >>pixelp+wl >>embedd+Dq

>>simonw+(OP)
Unfortunately Qwen3-next is not well supported on Apple silicon, it seems the Qwen team doesn't really care about Apple.

On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.

So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.

But who knows, maybe Qwen gives them a hand? (hint,hint)

replies(2): >>ttoino+pe >>cgearh+ka1

>>simonw+(OP)
I wonder if the future in ~5 years is almost all local models? High-end computers and GPUs can already do it for decent models, but not sota models. 5 years is enough time to ramp up memory production, consumers to level-up their hardware, and models to optimize down to lower-end hardware while still being really good.

replies(5): >>manbit+Sg >>infini+Wh >>regula+On >>johnsm+hL >>enlyth+bD1

>>dust42+dd
I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ?

replies(1): >>dust42+Ji

>>dehrma+wd
Plus a long queue of yet-undiscovered architectural improvements

replies(1): >>vercae+cI

>>ttoino+p4
Yes! I don't try to read agent tokens as they are generated, so if code generation decreases from 1 minute to 6 seconds, I'll be delighted. I'll even accept 10s -> 1s speedups. Considering how often I've seen agents spin wheels with different approaches, faster is always better, until models can 1-shot solutions without the repeated "No, wait..." / "Actually..." thinking loops

replies(1): >>pqtyw+0a1

>>dehrma+wd
A lot of manufacturers are bailing on consumer lines to focus on enterprise from what I've read. Not great.

>>ttoino+pe
KV caching means that when you have 10k prompt, all follow up questions return immediately - this is standard with all inference engines.

Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.

>>gigate+t8
Laptops are fundamentally a poor form factor for high performance computing.

>>simonw+(OP)
They run fairly well for me on my 128GB Framework Desktop.

replies(1): >>mitter+xS

>>1dom+43
I've had usable results with qwen3:30b, for what I was doing. There's definitely a knack to breaking the problem down enough for it.

What's interesting to me about this model is how good it allegedly is with no thinking mode. That's my main complaint about qwen3:30b, how verbose its reasoning is. For the size it's astonishing otherwise.

>>dehrma+wd
Even without leveling up hardware, 5 years is a loooong time to squeeze the juice out of lower-end model capability. Although in this specific niche we do seem to be leaning on Qwen a lot.

>>simonw+(OP)
you do realize claude opus/gpt5 are probably like 1000B-2000B models? So trying to have a model that's < 60B offer the same level of performance will be a miracle...

replies(3): >>jrop+lt >>regula+yl1 >>epolan+Jz1

>>gigate+t8
Yeah, Apple hardware don't seem ideal for LLMs that are large, give it a go with a dedicated GPU if you're inclined and you'll see a big difference :)

replies(2): >>polite+E21 >>marci+fG2

>>segmon+1q
I don't buy this. I've long wondered if the larger models, while exhibiting more useful knowledge, are not more wasteful as we greedily explore the frontier of "bigger is getting us better results, make it bigger". Qwen3-Coder-Next seems to be a point for that thought: we need to spend some time exploring what smaller models are capable of.

Perhaps I'm grossly wrong -- I guess time will tell.

replies(2): >>bityar+eH >>segmon+nM

>>simonw+(OP)
I have the same experience with local models. I really want to use them, but right now, they're not on par with propietary models on capabilities nor speed (at least if you're using a Mac).

replies(1): >>bityar+EE

>>1dom+43
Honestly I've been completely spoiled by Claude Code and Codex CLI against hosted models.

I'm hoping for an experience where I can tell my computer to do a thing - write a code, check for logged errors, find something in a bunch of files - and I get an answer a few moments later.

Setting a task and then coming back to see if it worked an hour later is too much friction for me!

>>dcastm+Bz
Local models on your laptop will never be as powerful as the ones that take up a rack of datacenter equipment. But there is still a surprising amount of overlap if you are willing to understand and accept the limitations.

>>jrop+lt
You are not wrong, small models can be trained for niche use cases and there are lots of people and companies doing that. The problem is that you need one of those for each use case whereas the bigger models can cover a bigger problem space.

There is also the counter-intuitive phenomenon where training a model on a wider variety of content than apparently necessary for the task makes it better somehow. For example, models trained only on English content exhibit measurably worse performance at writing sensible English than those trained on a handful of languages, even when controlling for the size of the training set. It doesn't make sense to me, but it probably does to credentialed AI researchers who know what's going on under the hood.

replies(3): >>abraae+9c1 >>dagss+Rl1 >>sally_+4M3

>>manbit+Sg
I'm suprised there isn't more "hope" in this area. Even things like the GPT Pro models; surely that sort of reasoning/synthesis will eventually make its way into local models. And that's something that's already been discovered.

Just the other day I was reading a paper about ANNs whose connections aren't strictly feedforward but, rather, circular connections proliferate. It increases expressiveness at the (huge) cost of eliminating the current gradient descent algorithms. As compute gets cheaper and cheaper, these things will become feasible (greater expressiveness, after all, equates to greater intelligence).

replies(1): >>bigfud+Oa1

>>dehrma+wd
Opensource or local models will always heavily lag frontier.

Who pays for a free model? GPU training isn't free!

I remember early on people saying 100B+ models will run on your phone like nowish. They were completely wrong and I don't think it's going to ever really change.

People always will want the fastest, best, easiest setup method.

"Good enough" massively changes when your marketing team is managing k8s clusters with frontier systems in the near future.

replies(6): >>margal+661 >>kybern+o71 >>__Matr+od1 >>Vinnl+gi1 >>torgin+nj1 >>bee_ri+0k1

>>embedd+w4
Are you running 120B agentic? I tried using it in a few different setups and it failed hard in every one. It would just give up after a second or two every time.

I wonder if it has to do with the message format, since it should be able to do tool use afaict.

replies(1): >>nekita+xe2

>>jrop+lt
eventually we will have smarter smaller models, but as of now, larger models are smarter by far. time and experience has already answered that.

replies(1): >>adastr+la1

>>organs+Zl
what do you run this on if I may ask? lmstudio, ollama, lama? which cli?

replies(2): >>MrDrMc+uF1 >>redwoo+GW1

>>simonw+(OP)
We need a new word, not "local model" but "my own computers model" CapEx based

This distinction is important because some "we support local model" tools have things like ollama orchestration or use the llama.cpp libraries to connect to models on the same physical machine.

That's not my definition of local. Mine is "local network". so call it the "LAN model" until we come up with something better. "Self-host" exists but this usually means more "open-weights" as opposed to clamping the performance of the model.

It should be defined as ~sub-$10k, using Steve Jobs megapenny unit.

Essentially classify things as how many megapennies of spend a machine is that won't OOM on it.

That's what I mean when I say local: running inference for 'free' somewhere on hardware I control that's at most single digit thousands of dollars. And if I was feeling fancy, could potentially fine-tune on the days scale.

A modern 5090 build-out with a threadripper, nvme, 256GB RAM, this will run you about 10k +/- 1k. The MLX route is about $6000 out the door after tax (m3-ultra 60 core with 256GB).

Lastly it's not just "number of parameters". Not all 32B Q4_K_M models load at the same rate or use the same amount of memory. The internal architecture matters and the active parameter count + quantization is becoming a poorer approximation given the SOTA innovations.

What might be needed is some standardized eval benchmark against standardized hardware classes with basic real world tasks like toolcalling, code generation, and document procesing. There's plenty of "good enough" models out there for a large category of every day tasks, now I want to find out what runs the best

Take a gen6 thinkpad P14s/macbook pro and a 5090/mac studio, run the benchmark and then we can say something like "time-to-first-token/token-per-second/memory-used/total-time-of-test" and rate this as independent from how accurate the model was.

replies(7): >>bigyab+3a1 >>echelo+0d1 >>christ+6f1 >>zozbot+8i1 >>opencl+iu1 >>mrklol+6t2 >>estima+V63

>>embedd+Dq
What are some good GPUs to look for if you're getting started?

replies(1): >>wincy+Dk2

>>johnsm+hL
I don't think this is as true as you think.

People do not care about the fastest and best past a point.

Let's use transportation as an analogy. If all you have is a horse, a car is a massive improvement. And when cars were just invented, a car with a 40mph top speed was a massive improvement over one with a 20mph top speed and everyone swapped.

While cars with 200mph top speeds exist, most people don't buy them. We all collectively decided that for most of us, most of the time, a top speed of 110-120 was plenty, and that envelope stopped being pushed for consumer vehicles.

If what currently takes Claude Opus 10 minutes to do can be done is 30ms, then making something that can do it in 20ms isn't going to be enough to get everyone to pay a bunch of extra money for.

Companies will buy the cheapest thing that meets their needs. SOTA models right now are much better than the previous generation but we have been seeing diminishing returns in the jump sizes with each of the last couple generations. If the gap between current and last gen shrinks enough, then people won't pay extra for current gen if they don't need it. Just like right now you might use Sonnet or Haiku if you don't think you need Opus.

replies(1): >>johnsm+pB1

>>johnsm+hL
Gpt3.5 as used in the first commercially available chat gpt is believed to be hundreds of billions of parameters. There are now models I can run on my phone that feel like they have similar levels of capability.

Phones are never going to run the largest models locally because they just don't have the size, but we're seeing improvements in capability at small sizes over time that mean that you can run a model on your phone now that would have required hundreds of billions of parameters less than 6 years ago.

replies(2): >>onion2+1d1 >>johnsm+YC1

>>overfe+Fh
> until models can 1-shot solutions without the repeated "No, wait..." / "Actually..." thinking loops

That would imply they'd have to be actually smarter than humans, not just faster and be able to scale infinitely. IMHO that's still very far away..

>>kristo+w21
OOM is a pretty terrible benchmark too, though. You can build a DDR4 machine that "technically" loads 256gb models for maybe $1000 used, but then you've got to account for the compute aspect and that's constrained by a number of different variables. A super-sparse model might run great on that DDR4 machine, whereas a 32b model would cause it to chug.

There's just not a good way to visualize the compute needed, with all the nuance that exists. I think that trying to create these abstractions are what leads to people impulse buying resource-constrained hardware and getting frustrated. The autoscalers have a huge advantage in this field that homelabbers will never be able to match.

replies(1): >>French+Ze1

>>dust42+dd
Any notes on the problems with MLX caching? I’ve experimented with local models on my MacBook and there’s usually a good speedup from MLX, but I wasn’t aware there’s an issue with prompt caching. Is it from MLX itself or LMstudio/mlx-lm/etc?

replies(2): >>dust42+mu1 >>anon37+kE1

>>segmon+nM
Eventually we might have smaller but just as smart models. There is no guarantee. There are information limits to smaller models of course.

>>vercae+cI
It seems like a lot of the benefits of SOTA models are from data though, not architecture? Won't the moat of the big 3/4 players in getting data only grow as they are integrated deeper into businesses workflows?

replies(1): >>vercae+Pc1

>>bityar+eH
Is that counterintuitive? If I had a model trained on 10 different programming languages, including my target language, I would expect it to do better than a model trained only on my target language, simply because it has access to so much more code/algorithms/examples then my language alone.

i.e. there is a lot of commonality between programming languages just as there is between human languages, so training on one language would be beneficial to competency in other languages.

replies(1): >>dagss+Hk1

>>bigfud+Oa1
That's a good point. I'm not familiar enough with the various moats to comment.

I was just talking at a high level. If transformers are HDD technology, maybe there's SSD right around the corner that's a paradigm shift for the whole industry (but for the average user just looks like better/smarter models). It's a very new field, and it's not unrealistic that major discoveries shake things up in the next decade or less.

>>kristo+w21
I don't even need "open weights" to run on hardware I own.

I am fine renting an H100 (or whatever), as long as I theoretically have access to and own everything running.

I do not want my career to become dependent upon Anthropic.

Honestly, the best thing for "open" might be for us to build open pipes and services and models where we can rent cloud. Large models will outpace small models: LLMs, video models, "world" models, etc.

I'd even be fine time-sharing a running instance of a large model in a large cloud. As long as all the constituent pieces are open where I could (in theory) distill it, run it myself, spin up my own copy, etc.

I do not deny that big models are superior. But I worry about the power the large hyperscalers are getting while we focus on small "open" models that really can't match the big ones.

We should focus on competing with large models, not artisanal homebrew stuff that is irrelevant.

replies(1): >>Aurorn+0q1

>>kybern+o71
The G in GPT stands for Generalized. You don't need that for specialist models, so the size can be much smaller. Even coding models are quite general as they don't focus on a language or a domain. I imagine a model specifically for something like React could be very effective with a couple of billion parameters, especially if it was a distill of a more general model.

replies(2): >>Mzxgck+Vh1 >>christ+Px1

>>johnsm+hL
I think we'll eventually find a way to make the cycle smaller, so instead of writing a stackoverflow post in 2024 and using a model trained on it in 2025 I'll be contributing to the expertise of a distributed-model-ish-thing on Monday and benefitting from that contribution on Tuesday.

When that happens, the most powerful AI will be whichever has the most virtuous cycles going with as wide a set of active users as possible. Free will be hard to compete with because raising the price will exclude the users that make it work.

Until then though, I think you're right that open will lag.

>>bigyab+3a1
> time-to-first-token/token-per-second/memory-used/total-time-of-test

Would it not help with the DDR4 example though if we had more "real world" tests?

replies(1): >>bigyab+Tg1

>>embedd+w4
You are describing distillation, there are better ways to do it, and it was done in the past, Deepseek distilled onto Qwen.

>>kristo+w21
I won't need a heater with that running in my room.

replies(2): >>hedora+PO1 >>wincy+3k2

>>French+Ze1
Maybe, but even that fourth-order metric is missing key performance details like context length and model size/sparsity.

The bigger takeaway (IMO) is that there will never really be hardware that scales like Claude or ChatGPT does. I love local AI, but it stresses the fundamental limits of on-device compute.

>>onion2+1d1
I'll be that guy: the "G" in GPT stands for "Generative".

>>kristo+w21
You can run plenty of models on a $10K machine or even a lot less than that, it all depends how much you want to wait for results. Streaming weights from SSD storage using mmap() is already a reality when running the largest and sparsest models. You can save even more on memory by limiting KV caching at the cost of extra compute, and there may be ways to push RAM savings even higher simply by tweaking the extent to which model activations are recomputed as needed.

replies(1): >>kristo+bl1

>>johnsm+hL
> People always will want the fastest, best, easiest setup method

When there are no other downsides, sure. But when the frontier companies start tightening the thumbscrews, price will influence what people consider good enough.

>>simonw+(OP)
I can't get Codex CLI or Claude Code to use small local models and to use tools. This is because those tools use XML and the small local models have JSON tool use baked into them. No amount of prompting can fix it.

In a day or two I'll release my answer to this problem. But, I'm curious, have you had a different experience where tool use works in one of these CLIs with a small local model?

replies(2): >>regula+Sk1 >>zackif+fn1

>>johnsm+hL
I don't know about frontier, I code nowadays a lot using Opus 4.5, in a way that I instruct it to do something (like complex refactor etc) - I like that it's really good at actually doing what its told and only occasionally do I have to fight it when it goes off the rails. It also does not hallucinate all that much in my experience (Im writing Js, YMMV with other languages), and is good at spotting dumb mistakes.

That said, I'm not sure if this capability is only achievable in huge frontier models, I would be perfectly content using a model that can do this (acting as a force multiplier), and not much else.

>>johnsm+hL
The calculation will probably get better for locally hosted models once investor generosity runs out for the remotely hosted models.

>>abraae+9c1
> simply because it has access to so much more code/algorithms/examples then my language alone

I assumed that is what was catered for with "even when controlling for the size of the training set".

I.e. assuming I am reading it right: That it is better to get the same data as 25% in 4 languages, than 100% in one language.

>>codazo+Ui1
Surely the answer is a very small proxy server between the two?

replies(1): >>codazo+sl1

>>zozbot+8i1
Yeah there's a lot of people that advocate for really slow inference on cheap infra. That's something else that should be expressed in this fidelity

Because honestly I don't care about 0.2 tps for my use cases although I've spoken with many who are fine with numbers like that.

At least the people I've talked to they talk about how if they have a very high confidence score that the model will succeed they don't mind the wait.

Essentially a task failure is 1 in 10, I want to monitor and retry.

If it's 1 in 1000, then I can walk away.

The reality is most people don't have a bearing on what this order of magnitude actually is for a given task. So unless you have high confidence in your confidence score, slow is useless

But sometimes you do...

replies(1): >>zozbot+Pm1

>>regula+Sk1
That might work, but I keep seeing people talk about this, so there must be a simple solution that I'm over-looking. My solution is to write my own minimal and experimental CLI that talks JSON tools.

>>segmon+1q
There is (must be - information theory) a size/capacity efficiency frontier. There is no particular reason to think we're anywhere near it right now.

>>bityar+eH
Not an AI researcher and I don't really know, but intuitively it makes a lot of sense to me.

To do well as an LLM you want to end up with the weights that gets furthest in the direction of "reasoning".

So assume that with just one language there's a possibility to get stuck in local optima of weights that do well on the English test set but which doesn't reason well.

If you then take the same model size but it has to manage to learn several languages, with the same number of weights, this would eliminate a lot of those local optima because if you don't manage to get the weights into a regime where real reasoning/deeper concepts is "understood" then it's not possible to do well with several languages with the same number of weights.

And if you speak several languages that would naturally bring in more abstraction, that the concept of "cat" is different from the word "cat" in a given language, and so on.

>>kristo+bl1
If you launch enough tasks in parallel you aren't going to care that 1 in 10 failed, as long as the other 9 are good. Just rerun the failed job whenever you get around to it, the infra will still be getting plenty of utilization on the rest.

>>codazo+Ui1
I'm using this model right now in claude code with LM Studio perfectly, on a macbook pro

replies(1): >>codazo+Wo1

>>zackif+fn1
You mean Qwen3-Coder-Next? I haven't tried that model itself, yet, because I assume it's too big for me. I have a modest 16GB MacBook Air so I'm restricted to really small stuff. I'm thinking about buying a machine with a GPU to run some of these.

Anywayz, maybe I should try some other models. The ones that haven't worked for tool calling, for me are:

Llama3.1

Llama3.2

Qwen2.5-coder

Qwen3-coder

All these in 7b, 8b, or sometimes 30b (painfully) models.

I should also note that I'm typically using Ollama. Maybe LM Studio or llama.cpp somehow improve on this?

replies(1): >>vessen+tt2

>>echelo+0d1
> I do not want my career to become dependent upon Anthropic

As someone who switches between Anthropic and ChatGPT depending on the month and has dabbled with other providers and some local LLMs, I think this fear is unfounded.

It's really easy to switch between models. The different models have some differences that you notice over time but the techniques you learn in one place aren't going to lock you into a provider anywhere.

replies(3): >>airstr+Tr1 >>echelo+At1 >>mrklol+kt2

>>Aurorn+0q1
right, but ChatGPT might not exist at some point, and if we don't force feed the open inference ecosystem and infrastructure back into the mouths of the AI devourer that is this hype cycle, we'll simply be accepting our inevitable, painful death

replies(2): >>christ+4x1 >>Aurorn+LW1

>>simonw+(OP)
I configured Claude Code to use a local model (ollama run glm-4.7-flash) that runs really well on a 32G M2Pro macmini. Maybe my standards are too low, but I was using that combination to clean up the code, make improvements, and add docs and tests to a bunch of old git repo experiment projects.

replies(1): >>redund+GT1

>>Aurorn+0q1
> It's really easy to switch between models. The different models have some differences that you notice over time but the techniques you learn in one place aren't going to lock you into a provider anywhere.

We have two cell phone providers. Google is removing the ability to install binaries, and the other one has never allowed freedom. All computing is taxed, defaults are set to the incumbent monopolies. Searching, even for trademarks, is a forced bidding war. Businesses have to shed customer relationships, get poached on brand relationships, and jump through hoops week after week. The FTC/DOJ do nothing, and the EU hasn't done much either.

I can't even imagine what this will be like for engineering once this becomes necessary to do our jobs. We've been spoiled by not needing many tools - other industries, like medical or industrial research, tie their employment to a physical location and set of expensive industrial tools. You lose your job, you have to physically move - possibly to another state.

What happens when Anthropic and OpenAI ban you? Or decide to only sell to industry?

This is just the start - we're going to become more dependent upon these tools to the point we're serfs. We might have two choices, and that's demonstrably (with the current incumbency) not a good world.

Computing is quickly becoming a non-local phenomenon. Google and the platforms broke the dream of the open web. We're about to witness the death of the personal computer if we don't do anything about it.

replies(1): >>pseudo+eP2

>>kristo+w21
For context on what cloud API costs look like when running coding agents:

With Claude Sonnet at $3/$15 per 1M tokens, a typical agent loop with ~2K input tokens and ~500 output per call, 5 LLM calls per task, and 20% retry overhead (common with tool use): you're looking at roughly $0.05-0.10 per agent task.

At 1K tasks/day that's ~$1.5K-3K/month in API spend.

The retry overhead is where the real costs hide. Most cost comparisons assume perfect execution, but tool-calling agents fail parsing, need validation retries, etc. I've seen retry rates push effective costs 40-60% above baseline projections.

Local models trading 50x slower inference for $0 marginal cost start looking very attractive for high-volume, latency-tolerant workloads.

replies(3): >>taneq+7E1 >>pstuar+LS1 >>jychan+f32

>>cgearh+ka1
It is the buffer implementation. [u1 10kTok]->[a1]->[u2]->[a2]. If you branch between the assistant1 and user2 answers then MLX does reprocess the u1 prompt of let's say 10k tokens while llama.cpp does not.

I just tested with GGUF and MLX of Qwen3-Coder-Next with llama.cpp and now with LMStudio. As I do branching very often, it is highly annoying for me to the point of being unusable. Q3-30B is much more usable then on Mac - but by far not as powerful.

>>airstr+Tr1
If they die there will be so much hardware released to do other tasks.

replies(1): >>echelo+5z1

>>onion2+1d1
Thats what i want and orchestrator model that operates with a small context and then very specialized small models for react etc

>>christ+4x1
Perhaps not tasks you get the opportunity to do.

Your job might be assigned to some other legal entity renting some other compute.

If this goes as according to some of their plans, we might all be out of the picture one day.

If these systems are closed, you might not get the opportunity to hire them yourself to build something you have ownership in. You might be cut out.

>>segmon+1q
Aren't both latest opus and sonnet smaller than the previous versions?

>>margal+661
This is the assumption of a hard plateu we can effectively optimize forever towards while possible we havn't seen it.

Again my point is "good enough" changes as possibilities open. Marketing teams running entire infra stacks is an insane idea today but may not be in the future.

You could easily code with a local model similar to gpt 4 or 3 now but I will 10-100x your performance with a frontier model and that will fundamentally not change.

Hmmm but maybe there's an argument of a static task. Once a model hits that ability of that specific task you can optimize it into a smaller model. So I guess I buy the argument for people working on statically capped conplexity tasks?

PII detection for example, a <500M model will outperform a 1-8B param model on that narrow task. But at the same time just a pii detection bot is not a product anymore. So yes a opensource one does it but as a result its fundamentally less valuable and I need to build higher and larger products for the value?

>>kybern+o71
Sure but the moment you can use that small model locally its capabilities are no longer differntiated or valuable no?

I supose the future will look exacrly like now. Some mixture of local and non local.

I guess my argument is that market dominated by local doesn't seem right and I think the balance will look similar to what it is right now

>>dehrma+wd
I'm hoping so. What's amazing is that with local models you don't suffer from what I call "usage anxiety" where I find myself saving my Claude usage for hypothetical more important things that may come up, or constantly adjusting prompts and doing some manual work myself to spare token usage.

Having this power locally means you can play around and experiment more without worries, it sounds like a wonderful future.

>>opencl+iu1
At this point isn’t the marginal cost based on power consumption? At 30c/kWh and with a beefy desktop pc pulling up to half a kW, that’s 15c/hr. For true zero marginal cost, maybe get solar panels. :P

replies(1): >>EGreg+5U1

>>cgearh+ka1
There’s this issue/outstanding PR: https://github.com/lmstudio-ai/mlx-engine/pull/188#issuecomm...

>>mitter+xS
Can't speak for parent, but I've had decent luck with llama.cpp on my triple Ryzen AI Pro 9700 XTs.

>>simonw+(OP)
Why don't you try it out in Opencode? It's possible to hook up the openrouter api, and some providers have started to host it there [1]. It's not yet available in opencode's model list [2].

Opencode's /connect command has a big list of providers, openrouter is on there.

[1] https://openrouter.ai/qwen/qwen3-coder-next

[2] https://opencode.ai/docs/zen/#endpoints

replies(1): >>simonw+P92

>>christ+6f1
This looks like it’ll run easily on a Strix Halo (180W TDP), and be a little sluggish on previous gen AMDs (80W TDP).

I can’t be bothered to check TDPs on 64GB macbooks, but none of these devices really count as space heaters.

>>opencl+iu1
Might there be a way to leverage local models just to help minimize the retries -- doing the tool calling handling and giving the agent "perfect execution"?

I'm a noob and am asking as wishful thinking.

replies(1): >>jermau+Ln3

>>mark_l+0s1
Did you have to do anything special to get it to work? I tried and it would just bug out, things like respond with JSON strings summarizing what I asked of it or just outright getting things wrong entirely. For example, I asked it to summarize what a specific .js file did and it provided me with new code it made up based on the file name...

replies(1): >>mark_l+oU1

>>taneq+7E1
This is an interesting question actually!

Marginal cost includes energy usage but also I burned out a MacBook GPU with vanity-eth last year so wear-and-tear is also a cost.

>>redund+GT1
Yes, I had to set the Ollama context size to 32K

replies(1): >>redund+s62

>>mitter+xS
I run Qwen3-Coder-Next (Qwen3-Coder-Next-UD-Q4_K_XL) on the Framework ITX board (Max+ 395 - 128GB) custom build. Avg. eval at 200-300 t/s and output at 35-40 t/s running with llama.cpp using rocm. Prefer Claude Code for cli.

>>airstr+Tr1
> right, but ChatGPT might not exist at some point

There are multiple frontier models to choose from.

They’re not all going to disappear.

replies(4): >>airstr+i22 >>hahajk+da2 >>Bukhma+pe2 >>kristo+qq6

>>Aurorn+LW1
right, and the less we rely on ChatGPT and Claude, the more we give power to "all other frontier models", which right now have very, very little market share

>>opencl+iu1
On the other hand, Deepseek V3.2 is $0.38 per million tokens output. And on openrouter, most providers serve it at 20 tokens/sec.

At 20t/s over 1 month, that's... $19something running literally 24/7. In reality it'd be cheaper than that.

I bet you'd burn more than $20 in electricity with a beefy machine that can run Deepseek.

The economics of batch>1 inference does not go in favor of consumers.

replies(1): >>selcuk+Fr2

>>mark_l+oU1
Thank you, it's working as expected now!

>>kristi+EN1
Oh good! OpenRouter didn't have it this morning when I first checked.

>>Aurorn+LW1
the companies could merge or buy each other

>>simonw+(OP)
TFW 48gb M4 Pro isn't going to run it.

>>Aurorn+LW1
This seems absurdly naive to me with the path big tech has taken in the last 5 years. There’s literally infinite upside and almost no downside to constraining the ecosystem for the big players.

You don’t think that eventually Google/OpenAI are going to go to the government and say, “it’s really dangerous to have all these foreign/unreglated models being used everywhere could you please get rid of them?”. Suddenly they have an oligopoly on the market.

>>andai+iM
This is a common problem for people trying to run the GPT-oss models themselves. Reposting my comment here:

GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:

https://openrouter.ai/docs/guides/best-practices/reasoning-t...

Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.

Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.

replies(1): >>andai+wH5

>>christ+6f1
Haha running OSS-120B on my 5090 with most of the layers in video memory, some in RAM with LM Studio, I was hard pressed to get it to actually use anywhere near the full 600W. Gaming in 4K playing a modern game generates substantially more sustained heat.

>>polite+E21
If you want to actually run models on a computer at home? The RTX 6000 Blackwell Pro Workstation, hands down. 96GB of VRAM, fits into a standard case (I mean, it’s big, as it’s essentially the same form factor as an RTX 5090 just with a lot denser VRAM).

My RTX 5090 can fit OSS-20B but it’s a bit underwhelming, and for $3000 if I didn’t also use it for gaming I’d have been pretty disappointed.

replies(1): >>gigate+Tx4

>>jychan+f32
> At 20t/s over 1 month, that's... $19something running literally 24/7.

You can run agents in parallel, but yeah, that's a fair comparison.

>>kristo+w21
I mean if it’s running in your lan, isn’t it local? :D

>>Aurorn+0q1
Because they make it easy. Imagine they limit their models to their tooling and suddenly it’s introducing work.

>>codazo+Wo1
I’m mostly out of the local model game, but I can say confidently that Llama will be a waste of time for agentic workflows - it was trained before agentic fine tuning was a thing, as far as I know. It’s going to be tough for tool calling, probably regardless of format you send the request in. Also 8b models are tiny. You could significantly upgrade your inference quality and keep your privacy with say a machine at lambda labs, or some cheaper provider, though. Probably for $1/hr - where an hour is a many times more inference than an hour on your MBA.

>>embedd+Dq
Their issue with the mac was the sound of fans spinning. I doubt a dedicated gpu will resolved that.

>>echelo+At1
I just don’t see it.

I mean, the long arch of computing history has had us wobble back and forth in regards to how closed down it all was, but it seems we are almost at a golden age again with respect to good enough (if not popular) hardware.

On the software front, we definitely swung back from the age of Microsoft. Sure, Linux is a lot more corporate than people admit, but it’s a lot more open than Microsoft’s offerings and it’s capable of running on practically everything except the smallest IOT device.

As for LLMs. I know people have hyped themselves up to think that if you aren’t chasing the latest LLM release and running swarms of agents, you are next in the queues for the soup kitchens, but again, I don’t see why it HAS to play out that way, partly because of history (as referenced), partly because open models are already so impressive and I don’t see any reason why they wouldn’t continue to do well.

In fact, I do my day-to-day work using an open weight model. Beyond that, can only say I know employers who will probably never countenance using commercially hosted LLMs, but who are already setting up self-hosted ones based on open weight releases.

replies(1): >>Orygin+Fb3

>>kristo+w21
Local as in localhost

replies(1): >>helly2+IJ4

>>pseudo+eP2
> but it seems we are almost at a golden age again with respect to good enough (if not popular) hardware.

I don't think we're in any golden age since the GPU shortages started, and now memory and disks are becoming super expensive too.

Hardware vendors have shown they don't have an interest in serving consumers and will sell out to hyperscalers the moment they show some green bills. I fear a day where you won't be able to purchase powerful (enough) machines and will be forced to subscribe to a commercial provider to get some compute to do your job.

>>1dom+43
30-A3B model gives 13 t/s without GPU (I noticed that token/sec * # of params matches memory bandwidth).

>>pstuar+LS1
> I'm a noob and am asking as wishful thinking.

Don't minimize your thoughts! Outside voices and naive questions sometimes provide novel insights that might be dismissed, but someone might listen.

I've not done this exactly, but I have setup "chains" that create a fresh context for tool calls so their call chains don't fill the main context. There is no reason why the Tool Calls couldn't be redirected to another LLM endpoint (local for instance). Especially with something like gpt-oss-20b, where I've found executing tools happens at a higher success than claude sonnet via openrouter.

>>bityar+eH
Cool, I didn't know about this phenomenon. Reading up a little it seems like training multilingual forces the model to optimize it's internal "conceptual layer" weights better instead of relying solely on English linguistics. Papers also mention issues arising from overdoing it, so my guess is even credentialed AI researchers are currently limited to empirical methods here.

>>wincy+Dk2
At anywhere from 9-12k euros [1] I’d be better off paying 200 a month for the super duper lots of tokens tier at 2400 a year and get model improvements and token improvements etc etc for “free” than buy up such a card and it be obsolete on purchase as newer better cards are always coming out.

[1] https://www.idealo.de/preisvergleich/OffersOfProduct/2063285...

>>estima+V63
Local!

I do not mind the cost honestly. And a bit slower also works. I just use one older mac ultra 2/192G ram and another with an rtx5060/16G and an and r9700/32G. Between those I get my models working fine.

That also gives me full privacy. And that is worth way way way more than any cost.

>>nekita+xe2
I used it with OpenAI's Codex, which had official support for it, and it was still ass. (Maybe they screwed up this part too? Haha)

>>Aurorn+LW1
Yes they are.

It'll all be open weights commodity just like all Unix vendors disappeared