zlacker

What toolchain are you going to use with the local model? I agree that’s a Strong model, but it’s so slow for be with large contexts I’ve stopped using it for coding.

replies(1): >>embedd+Zy

>>mercut+(OP)
I have my own agent harness, and the inference backend is vLLM.

replies(2): >>storys+wO >>mercut+t63

>>embedd+Zy
Curious how you handle sharding and KV cache pressure for a 120b model. I guess you are doing tensor parallelism across consumer cards, or is it a unified memory setup?

replies(1): >>embedd+wQ

>>storys+wO
I don't, fits on my card with the full context, I think the native MXFP4 weights takes ~70GB of VRAM (out of 96GB available, RTX Pro 6000), so I still have room to spare to run GPT-OSS-20B alongside for smaller tasks too, and Wayland+Gnome :)

replies(1): >>storys+o21

>>embedd+wQ
I thought the RTX 6000 Ada was 48GB? If you have 96GB available that implies a dual setup, so you must be relying on tensor parallelism to shard the model weights across the pair.

replies(1): >>embedd+B41

>>storys+o21
RTX Pro 6000 - 96GB VRAM - Single card

>>embedd+Zy
Can you tell me more about your agent harness? If it’s open source, I’d love to take it for a spin.

I would happily use local models if I could get them to perform, but they’re super slow if I bump their context window high, and I haven’t seen good orchestrators that keep context limited enough.