Qwen3-Coder-Next

>>daniel+(OP)
For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next

>>daniel+(OP)
This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.

I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.

Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next

>>zamada+h8
For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot

>>zamada+a5
SWE-Bench Pro consists of 1865 tasks. https://arxiv.org/abs/2509.16941 Qwen3-Coder-Next solved 44.3% (826 or 827) of these tasks. To solve a single task, it took between ≈50 and ≈280 agent turns, ≈150 on average. In other words, a single pass through the dataset took ≈280000 agent turns. Kimi-K2.5 solved ≈84 fewer tasks, but also only took about a third as many agent turns.

>>daniel+(OP)
As always, the Qwen team is pushing out fantastic content

Hope they update the model page soon https://chat.qwen.ai/settings/model

>>Camper+Tn
You can buy this product, right here: https://platform.claude.com/docs/en/about-claude/pricing

That's not the product you buy when you a Claude Code token, though.

>>daniel+(OP)
I got the Qwen3 Coder 30B running locally on mac Mac M4 Max 36GB. It was slow, but it worked and did do some decent stuff: https://www.youtube.com/watch?v=7mAPaRbsjTU

Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding

>>cedws+ag
What do you require local models to do? The State of Utopia[1] is currently busy porting a small model to run in a zero-trust environment - your web browser. It's finished the port in javascript and is going to wasm now for the CPU path. you can see it being livecoded by Claude right now[2] (this is day 2, day 1 it ported the C++ code to javascript successfully). We are curious to know what permissions you would like to grant such a model and how you would like it served to you. (For example, we consider that you wouldn't trust a Go build - especially if it's built by a nation state, regardless of our branding, practices, members or contributors.)

Please list what capabilities you would like our local model to have and how you would like to have it served to you.

[1] a sovereign digital nation built on a national framework rather than a for-profit or even non-profit framework, will be available at https://stateofutopia.com (you can see some of my recent posts or comments here on HN.)

[2] https://www.youtube.com/live/0psQ2l4-USo?si=RVt2PhGy_A4nYFPi

>>zokier+lM
This model does not fit in 12G of VRAM - even the smallest quant is unlikely to fit. However, portions can be offloaded to regular RAM / CPU with a performance hit.

I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.

The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...

>>daniel+(OP)
Benchmarks using DGX Spark on vLLM 0.15.1.dev0+gf17644344

  FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8

  Sequential (single request)

    Prompt     Gen     Prompt Processing    Token Gen
    Tokens     Tokens  (tokens/sec)         (tokens/sec)
    ------     ------  -----------------    -----------
       521        49            3,157            44.2
     1,033        83            3,917            43.7
     2,057        77            3,937            43.6
     4,105        77            4,453            43.2
     8,201        77            4,710            42.2

  Parallel (concurrent requests)

    pp4096+tg128 (4K context, 128 gen):

     n    t/s
    --    ----
     1    28.5
     2    39.0
     4    50.4
     8    57.5
    16    61.4
    32    62.0

    pp8192+tg128 (8K context, 128 gen):

     n    t/s
    --    ----
     1    21.6
     2    27.1
     4    31.9
     8    32.7
    16    33.7
    32    31.7

>>daniel+(OP)
I got openclaw to compete Qwen3-Coder-Next vs Minimax M2.1 simultaneously on my Mac Studio 512GB: https://clutch-assistant.github.io/model-comparison-report/

>>cgearh+Fd1
There’s this issue/outstanding PR: https://github.com/lmstudio-ai/mlx-engine/pull/188#issuecomm...

>>simonw+l3
Why don't you try it out in Opencode? It's possible to hook up the openrouter api, and some providers have started to host it there [1]. It's not yet available in opencode's model list [2].

Opencode's /connect command has a big list of providers, openrouter is on there.

[1] https://openrouter.ai/qwen/qwen3-coder-next

[2] https://opencode.ai/docs/zen/#endpoints

>>ranger+iv1
Oh we wrote about it here: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

>>Keats+2c1
I haven't done benchmarking yet (plan to do them), but it should be similar to our post on DeepSeek-V3.1 Dynamic GGUFs: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

>>bityar+7M
Thanks! Oh Qwen3's own GGUFs also works, but ours are dynamically quantized and calibrated with a reasonably large diverse dataset, whilst Qwen's ones are not - see https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

>>endymi+T4
It might be Screen Studio [0] -- I was gonna write "99% sure" but now I'm not sure at all!!

[0] https://screen.studio

>>andai+DP
This is a common problem for people trying to run the GPT-oss models themselves. Reposting my comment here:

GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:

https://openrouter.ai/docs/guides/best-practices/reasoning-t...

Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.

Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.

>>Nitpic+Ff
Nope! https://www.linkedin.com/in/philipsorensen

But as a non-native english speaker, I do use AI to help me formulate my thoughts more clearly. Maybe this is off putting? :)

>>Soeren+1E3
Yes, that's definitely a bad idea because the community picks up on it and dismisses the entire comment set as generated. Generated comments aren't allowed on HN, and readers are super-sensitive about this these days.

The non-native speaker point is understandable, of course, but you're much better off writing in your own voice, even if a few mistakes sneak in (who cares, that's fine!). Non-native speakers are more than welcome on HN.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

>>wincy+Yn2
At anywhere from 9-12k euros [1] I’d be better off paying 200 a month for the super duper lots of tokens tier at 2400 a year and get model improvements and token improvements etc etc for “free” than buy up such a card and it be obsolete on purchase as newer better cards are always coming out.

[1] https://www.idealo.de/preisvergleich/OffersOfProduct/2063285...

>>dang+dr4
Comment 1: >>46873799 2026-02-03T17:12:55 1770138775

Comment 2: >>46873809 2026-02-03T17:13:40 1770138820

Comment 3: >>46873820 2026-02-03T17:14:25 1770138865

All detailed comments in different threads posted exactly 45 seconds apart, unless the HN timestamps aren't accurate.

That's very impressive if the account is not "generated comments", even using speech-to-text via AI. I'll leave it at that.

>>simonw+wy1
You probably have seen it by now, but there was a llama.cpp issue that was fixed earlier today(?) to avoid looping and other sub-par results. Need to update llama-server as well as redownload the GGUFs (for certain quants).

https://old.reddit.com/r/unsloth/comments/1qvt6qy/qwen3coder...

zlacker

Qwen3-Coder-Next