zlacker

[parent] [thread] 2 comments
1. bigyab+(OP)[view] [source] 2026-02-05 01:33:54
> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

replies(1): >>Muffin+z3
2. Muffin+z3[view] [source] 2026-02-05 02:01:59
>>bigyab+(OP)
Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?
replies(1): >>omneit+Nm
◧◩
3. omneit+Nm[view] [source] [discussion] 2026-02-05 05:07:23
>>Muffin+z3
It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.

[go to top]