Wildly understating this part.
Even the best local models (ones you run on beefy 128GB+ RAM machines) get nowhere close to the sheer intelligence of Claude/Gemini/Codex. At worst these models will move you backwards and just increase the amount of work Claude has to do when your limits reset.
Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?
So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...
> How much VRAM does it take to get the 92-95% you are speaking of?
For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.