zlacker

[parent] [thread] 3 comments
1. jychan+(OP)[view] [source] 2025-12-06 23:04:35
The catch that you're missing is that Deepseek did this ages ago.

They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.

Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.

replies(2): >>ericho+I >>storus+j1
2. ericho+I[view] [source] 2025-12-06 23:10:57
>>jychan+(OP)
Kimi K2 also uses MLA, and Kimi Linear runs Kimi Delta Attention (it's SSM-like) for three out of every four layers (the fourth uses MLA).
replies(1): >>jychan+a2
3. storus+j1[view] [source] 2025-12-06 23:14:54
>>jychan+(OP)
Linear attention is really bad, it's only good for benchmaxing but it leads to a loss of valuable granularity, which can be felt in the latest DeepSeek randomly forgetting/ignoring/correcting explicitly stated facts in the prompt.
◧◩
4. jychan+a2[view] [source] [discussion] 2025-12-06 23:21:11
>>ericho+I
Kimi K2 is literally a "copy Deepseek's homework" model. Seriously. It's even exactly 61 layers, the same as Deepseek V3/R1.
[go to top]