zlacker

[parent] [thread] 4 comments
1. 0xbadc+(OP)[view] [source] 2026-02-05 00:46:00
Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
replies(1): >>Muffin+v4
2. Muffin+v4[view] [source] 2026-02-05 01:23:17
>>0xbadc+(OP)
> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

replies(1): >>bigyab+M5
◧◩
3. bigyab+M5[view] [source] [discussion] 2026-02-05 01:33:54
>>Muffin+v4
> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

replies(1): >>Muffin+l9
◧◩◪
4. Muffin+l9[view] [source] [discussion] 2026-02-05 02:01:59
>>bigyab+M5
Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?
replies(1): >>omneit+zs
◧◩◪◨
5. omneit+zs[view] [source] [discussion] 2026-02-05 05:07:23
>>Muffin+l9
It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.

[go to top]