I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.
The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...
one more thing, that guide says:
> You can choose UD-Q4_K_XL or other quantized versions.
I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use?
IQ4_XS
Q4_K_S
Q4_1
IQ4_NL
MXFP4_MOE
Q4_0
Q4_K_M
Q4_K_XL