I'll add on https://unsloth.ai/docs/models/qwen3-coder-next
The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.
Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?
So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...
> How much VRAM does it take to get the 92-95% you are speaking of?
For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.
Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.
Also the proprietary models a year ago were not that good for anything beyond basic tasks.
Now as the other replies say, you should very likely run a quantized version anyway.
No one's running Sonnet/Gemini/GPT-5 locally though.
Number of params == “variables” in memory
VRAM footprint ~= number of params * size of a param
A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.