zlacker

[return to "Claude Code: connect to a local model when your quota runs out"]
1. paxys+c7c[view] [source] 2026-02-04 21:59:44
>>fugu2+(OP)
> Reduce your expectations about speed and performance!

Wildly understating this part.

Even the best local models (ones you run on beefy 128GB+ RAM machines) get nowhere close to the sheer intelligence of Claude/Gemini/Codex. At worst these models will move you backwards and just increase the amount of work Claude has to do when your limits reset.

◧◩
2. zozbot+i8c[view] [source] 2026-02-04 22:05:11
>>paxys+c7c
The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago. That's not "nothing" and is plenty good enough for everyday work.
◧◩◪
3. reilly+A9c[view] [source] 2026-02-04 22:11:11
>>zozbot+i8c
Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run at full quality…
◧◩◪◨
4. 0xbadc+yyc[view] [source] 2026-02-05 00:46:00
>>reilly+A9c
Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
◧◩◪◨⬒
5. Muffin+3Dc[view] [source] 2026-02-05 01:23:17
>>0xbadc+yyc
> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

◧◩◪◨⬒⬓
6. bigyab+kEc[view] [source] 2026-02-05 01:33:54
>>Muffin+3Dc
> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

◧◩◪◨⬒⬓⬔
7. Muffin+THc[view] [source] 2026-02-05 02:01:59
>>bigyab+kEc
Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?
◧◩◪◨⬒⬓⬔⧯
8. omneit+71d[view] [source] 2026-02-05 05:07:23
>>Muffin+THc
It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.

[go to top]