Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?
So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...
> How much VRAM does it take to get the 92-95% you are speaking of?
For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.
Now as the other replies say, you should very likely run a quantized version anyway.
Number of params == “variables” in memory
VRAM footprint ~= number of params * size of a param
A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.
https://buildai.substack.com/i/181542049/the-mac-mini-moment