I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
Hope they update the model page soon https://chat.qwen.ai/settings/model
That's not the product you buy when you a Claude Code token, though.
Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding
Please list what capabilities you would like our local model to have and how you would like to have it served to you.
[1] a sovereign digital nation built on a national framework rather than a for-profit or even non-profit framework, will be available at https://stateofutopia.com (you can see some of my recent posts or comments here on HN.)
[2] https://www.youtube.com/live/0psQ2l4-USo?si=RVt2PhGy_A4nYFPi
I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.
The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...
FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8
Sequential (single request)
Prompt Gen Prompt Processing Token Gen
Tokens Tokens (tokens/sec) (tokens/sec)
------ ------ ----------------- -----------
521 49 3,157 44.2
1,033 83 3,917 43.7
2,057 77 3,937 43.6
4,105 77 4,453 43.2
8,201 77 4,710 42.2
Parallel (concurrent requests)
pp4096+tg128 (4K context, 128 gen):
n t/s
-- ----
1 28.5
2 39.0
4 50.4
8 57.5
16 61.4
32 62.0
pp8192+tg128 (8K context, 128 gen):
n t/s
-- ----
1 21.6
2 27.1
4 31.9
8 32.7
16 33.7
32 31.7Opencode's /connect command has a big list of providers, openrouter is on there.
GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:
https://openrouter.ai/docs/guides/best-practices/reasoning-t...
Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.
Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.
But as a non-native english speaker, I do use AI to help me formulate my thoughts more clearly. Maybe this is off putting? :)
The non-native speaker point is understandable, of course, but you're much better off writing in your own voice, even if a few mistakes sneak in (who cares, that's fine!). Non-native speakers are more than welcome on HN.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
[1] https://www.idealo.de/preisvergleich/OffersOfProduct/2063285...
Comment 2: >>46873809 2026-02-03T17:13:40 1770138820
Comment 3: >>46873820 2026-02-03T17:14:25 1770138865
All detailed comments in different threads posted exactly 45 seconds apart, unless the HN timestamps aren't accurate.
That's very impressive if the account is not "generated comments", even using speech-to-text via AI. I'll leave it at that.
https://old.reddit.com/r/unsloth/comments/1qvt6qy/qwen3coder...