zlacker

[parent] [thread] 2 comments
1. cgearh+(OP)[view] [source] 2026-02-03 21:15:33
Any notes on the problems with MLX caching? I’ve experimented with local models on my MacBook and there’s usually a good speedup from MLX, but I wasn’t aware there’s an issue with prompt caching. Is it from MLX itself or LMstudio/mlx-lm/etc?
replies(2): >>dust42+2k >>anon37+0u
2. dust42+2k[view] [source] 2026-02-03 23:00:30
>>cgearh+(OP)
It is the buffer implementation. [u1 10kTok]->[a1]->[u2]->[a2]. If you branch between the assistant1 and user2 answers then MLX does reprocess the u1 prompt of let's say 10k tokens while llama.cpp does not.

I just tested with GGUF and MLX of Qwen3-Coder-Next with llama.cpp and now with LMStudio. As I do branching very often, it is highly annoying for me to the point of being unusable. Q3-30B is much more usable then on Mac - but by far not as powerful.

3. anon37+0u[view] [source] 2026-02-03 23:56:56
>>cgearh+(OP)
There’s this issue/outstanding PR: https://github.com/lmstudio-ai/mlx-engine/pull/188#issuecomm...
[go to top]