Qwen3-Coder-Next

>>daniel+(OP)
I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this:

  brew upgrade llama.cpp # or brew install if you don't have it yet

Then:

  llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this:

  llama-server \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

It's using about 28GB of RAM.

>>simonw+TH
what are your impressions?

>>techno+mb1
I got Codex CLI running against it and was sadly very unimpressed - it got stuck in a loop running "ls" for some reason when I asked it to create a new file.

>>simonw+wy1
I would recommend you fiddle with the repeat penalty flags. I use local models often, and almost all I've tried needed that to prevent loops.

I'd also recommend dropping temperature down to 0. Any high temperature value feels like instructing the model "copy this homework from me but don't make it obvious".

zlacker