> Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that...
Hmmm
I don't know what's so special about this paper.
- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)
- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.
- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.
Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.
I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?
From Zebra-Llama's arXiv page: Submitted on 22 May 2025
Don't forget the billion dollars or so of GPU's they had access to that they left out of that accounting. Also, the R&D cost of the Meta model they originally used. Then, they added $5.6 million on top of that.