Qwen3-Coder-Next

>>daniel+(OP)
This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.

I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.

Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next

>>simonw+l3
Unfortunately Qwen3-next is not well supported on Apple silicon, it seems the Qwen team doesn't really care about Apple.

On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.

So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.

But who knows, maybe Qwen gives them a hand? (hint,hint)

>>dust42+yg
Any notes on the problems with MLX caching? I’ve experimented with local models on my MacBook and there’s usually a good speedup from MLX, but I wasn’t aware there’s an issue with prompt caching. Is it from MLX itself or LMstudio/mlx-lm/etc?

>>cgearh+Fd1
It is the buffer implementation. [u1 10kTok]->[a1]->[u2]->[a2]. If you branch between the assistant1 and user2 answers then MLX does reprocess the u1 prompt of let's say 10k tokens while llama.cpp does not.

I just tested with GGUF and MLX of Qwen3-Coder-Next with llama.cpp and now with LMStudio. As I do branching very often, it is highly annoying for me to the point of being unusable. Q3-30B is much more usable then on Mac - but by far not as powerful.

zlacker