I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
This distinction is important because some "we support local model" tools have things like ollama orchestration or use the llama.cpp libraries to connect to models on the same physical machine.
That's not my definition of local. Mine is "local network". so call it the "LAN model" until we come up with something better. "Self-host" exists but this usually means more "open-weights" as opposed to clamping the performance of the model.
It should be defined as ~sub-$10k, using Steve Jobs megapenny unit.
Essentially classify things as how many megapennies of spend a machine is that won't OOM on it.
That's what I mean when I say local: running inference for 'free' somewhere on hardware I control that's at most single digit thousands of dollars. And if I was feeling fancy, could potentially fine-tune on the days scale.
A modern 5090 build-out with a threadripper, nvme, 256GB RAM, this will run you about 10k +/- 1k. The MLX route is about $6000 out the door after tax (m3-ultra 60 core with 256GB).
Lastly it's not just "number of parameters". Not all 32B Q4_K_M models load at the same rate or use the same amount of memory. The internal architecture matters and the active parameter count + quantization is becoming a poorer approximation given the SOTA innovations.
What might be needed is some standardized eval benchmark against standardized hardware classes with basic real world tasks like toolcalling, code generation, and document procesing. There's plenty of "good enough" models out there for a large category of every day tasks, now I want to find out what runs the best
Take a gen6 thinkpad P14s/macbook pro and a 5090/mac studio, run the benchmark and then we can say something like "time-to-first-token/token-per-second/memory-used/total-time-of-test" and rate this as independent from how accurate the model was.
With Claude Sonnet at $3/$15 per 1M tokens, a typical agent loop with ~2K input tokens and ~500 output per call, 5 LLM calls per task, and 20% retry overhead (common with tool use): you're looking at roughly $0.05-0.10 per agent task.
At 1K tasks/day that's ~$1.5K-3K/month in API spend.
The retry overhead is where the real costs hide. Most cost comparisons assume perfect execution, but tool-calling agents fail parsing, need validation retries, etc. I've seen retry rates push effective costs 40-60% above baseline projections.
Local models trading 50x slower inference for $0 marginal cost start looking very attractive for high-volume, latency-tolerant workloads.
At 20t/s over 1 month, that's... $19something running literally 24/7. In reality it'd be cheaper than that.
I bet you'd burn more than $20 in electricity with a beefy machine that can run Deepseek.
The economics of batch>1 inference does not go in favor of consumers.