zlacker

[parent] [thread] 1 comments
1. ttoino+(OP)[view] [source] 2026-02-03 17:13:29
I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ?
replies(1): >>dust42+k4
2. dust42+k4[view] [source] 2026-02-03 17:31:53
>>ttoino+(OP)
KV caching means that when you have 10k prompt, all follow up questions return immediately - this is standard with all inference engines.

Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.

[go to top]