zlacker

> I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful

I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.

I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.

replies(3): >>gigate+X3 >>andai+MH >>pocksu+va1

>>embedd+(OP)
I’ve a 128GB m3 max MacBook Pro. Running the gpt oss model on it via lmstudio once the context gets large enough the fans spin to 100 and it’s unbearable.

replies(2): >>pixelp+0h >>embedd+7m

>>gigate+X3
Laptops are fundamentally a poor form factor for high performance computing.

>>gigate+X3
Yeah, Apple hardware don't seem ideal for LLMs that are large, give it a go with a dedicated GPU if you're inclined and you'll see a big difference :)

replies(2): >>polite+8Y >>marci+JB2

>>embedd+(OP)
Are you running 120B agentic? I tried using it in a few different setups and it failed hard in every one. It would just give up after a second or two every time.

I wonder if it has to do with the message format, since it should be able to do tool use afaict.

replies(1): >>nekita+1a2

>>embedd+7m
What are some good GPUs to look for if you're getting started?

replies(1): >>wincy+7g2

>>embedd+(OP)
You are describing distillation, there are better ways to do it, and it was done in the past, Deepseek distilled onto Qwen.

>>andai+MH
This is a common problem for people trying to run the GPT-oss models themselves. Reposting my comment here:

GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:

https://openrouter.ai/docs/guides/best-practices/reasoning-t...

Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.

Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.

replies(1): >>andai+0D5

>>polite+8Y
If you want to actually run models on a computer at home? The RTX 6000 Blackwell Pro Workstation, hands down. 96GB of VRAM, fits into a standard case (I mean, it’s big, as it’s essentially the same form factor as an RTX 5090 just with a lot denser VRAM).

My RTX 5090 can fit OSS-20B but it’s a bit underwhelming, and for $3000 if I didn’t also use it for gaming I’d have been pretty disappointed.

replies(1): >>gigate+nt4

>>embedd+7m
Their issue with the mac was the sound of fans spinning. I doubt a dedicated gpu will resolved that.

>>wincy+7g2
At anywhere from 9-12k euros [1] I’d be better off paying 200 a month for the super duper lots of tokens tier at 2400 a year and get model improvements and token improvements etc etc for “free” than buy up such a card and it be obsolete on purchase as newer better cards are always coming out.

[1] https://www.idealo.de/preisvergleich/OffersOfProduct/2063285...

>>nekita+1a2
I used it with OpenAI's Codex, which had official support for it, and it was still ass. (Maybe they screwed up this part too? Haha)