zlacker

[parent] [thread] 7 comments
1. embedd+(OP)[view] [source] 2026-01-27 23:57:25
Yeah, no way I'd do this if I paid per token. Next experiment will probably be local-only together with GPT-OSS-120b which according to my own benchmarks seems to still be the strongest local model I can run myself. It'll be even cheaper then (as long as we don't count the money it took to acquire the hardware).
replies(1): >>mercut+3n
2. mercut+3n[view] [source] 2026-01-28 02:48:56
>>embedd+(OP)
What toolchain are you going to use with the local model? I agree that’s a Strong model, but it’s so slow for be with large contexts I’ve stopped using it for coding.
replies(1): >>embedd+2W
◧◩
3. embedd+2W[view] [source] [discussion] 2026-01-28 08:36:11
>>mercut+3n
I have my own agent harness, and the inference backend is vLLM.
replies(2): >>storys+zb1 >>mercut+wt3
◧◩◪
4. storys+zb1[view] [source] [discussion] 2026-01-28 10:32:51
>>embedd+2W
Curious how you handle sharding and KV cache pressure for a 120b model. I guess you are doing tensor parallelism across consumer cards, or is it a unified memory setup?
replies(1): >>embedd+zd1
◧◩◪◨
5. embedd+zd1[view] [source] [discussion] 2026-01-28 10:49:58
>>storys+zb1
I don't, fits on my card with the full context, I think the native MXFP4 weights takes ~70GB of VRAM (out of 96GB available, RTX Pro 6000), so I still have room to spare to run GPT-OSS-20B alongside for smaller tasks too, and Wayland+Gnome :)
replies(1): >>storys+rp1
◧◩◪◨⬒
6. storys+rp1[view] [source] [discussion] 2026-01-28 12:24:54
>>embedd+zd1
I thought the RTX 6000 Ada was 48GB? If you have 96GB available that implies a dual setup, so you must be relying on tensor parallelism to shard the model weights across the pair.
replies(1): >>embedd+Er1
◧◩◪◨⬒⬓
7. embedd+Er1[view] [source] [discussion] 2026-01-28 12:40:08
>>storys+rp1
RTX Pro 6000 - 96GB VRAM - Single card
◧◩◪
8. mercut+wt3[view] [source] [discussion] 2026-01-28 22:16:04
>>embedd+2W
Can you tell me more about your agent harness? If it’s open source, I’d love to take it for a spin.

I would happily use local models if I could get them to perform, but they’re super slow if I bump their context window high, and I haven’t seen good orchestrators that keep context limited enough.

[go to top]