zlacker

The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago. That's not "nothing" and is plenty good enough for everyday work.

replies(6): >>reilly+i1 >>teaear+y1 >>paxys+W3 >>corysa+Fg >>0xbadc+2q >>Aurorn+8A

>>zozbot+(OP)
Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run at full quality…

replies(4): >>teaear+G1 >>bigyab+en >>0xbadc+gq >>deaux+1G

>>zozbot+(OP)
Having used K2.5 I’d judge it to be a little better than that. Maybe as good as proprietary models from last June?

>>reilly+i1
Which while expensive is dirt cheap compared to a comparable NVidia or AMD system.

replies(2): >>blharr+76 >>Schema+K6

>>zozbot+(OP)
LOCAL models. No one is running Kimi 2.5 on their Macbook or RTX 4090.

replies(2): >>Dennis+Ol >>deaux+eG

>>teaear+G1
What speed are you getting at that level of hardware though?

>>teaear+G1
It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.

replies(2): >>cactus+va >>whatsu+OI

>>Schema+K6
Inference is profitable. Maybe we hit a limit and we don't need as many expensive training runs in the future.

replies(2): >>teaear+dl >>paxys+Lm

>>zozbot+(OP)
The article mentions https://unsloth.ai/docs/basics/claude-codex

I'll add on https://unsloth.ai/docs/models/qwen3-coder-next

The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.

>>cactus+va
For sure Claude Code isn’t profitable

replies(1): >>bdangu+Pq

>>paxys+W3
On Macbooks, no. But there are a few lunatics like this guy:

https://www.youtube.com/watch?v=bFgTxr5yst0

>>cactus+va
Inference APIs are probably profitable, but I doubt the $20-$100 monthly plans are.

>>reilly+i1
"Full quality" being a relative assessment, here. You're still deeply compute constrained, that machine would crawl at longer contexts.

>>zozbot+(OP)
Kimi K2.5 is fourth place for intelligence right now. And it's not as good as the top frontier models at coding, but it's better than Claude 4.5 Sonnet. https://artificialanalysis.ai/models

>>reilly+i1
Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

replies(1): >>Muffin+Lu

>>teaear+dl
Neither was Uber and … and …

replies(1): >>plagia+0y

>>0xbadc+gq
> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

replies(1): >>bigyab+2w

>>Muffin+Lu
> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

replies(1): >>Muffin+Bz

>>bdangu+Pq
Businesses will desire me for my insomnia once Anthropics starts charging congestion pricing.

>>bigyab+2w
Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?

replies(1): >>omneit+PS

>>zozbot+(OP)
> The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago

Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.

Also the proprietary models a year ago were not that good for anything beyond basic tasks.

>>reilly+i1
And that's at unusable speeds - it takes about triple that amount to run it decently fast at int4.

Now as the other replies say, you should very likely run a quantized version anyway.

>>paxys+W3
Some people spend $50k on a new car, others spend it on running Kimi K2.5 at good speeds locally.

No one's running Sonnet/Gemini/GPT-5 locally though.

>>Schema+K6
I wonder if the "distributed AI computing" touted by some of the new crypto projects [0] works and is relatively cheaper.

0. https://www.daifi.ai/

>>Muffin+Bz
It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.