that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne
*previously miswrote and said computational efficiency will go down
if computational efficiency goes up (thanks for the correction), and CPU inference becomes viable for most practical applications, GPUs (or accelerators) themselves may be unnecessary for most practical functions
We already see models become more and more capable per weight and per unit of compute. I don't expect a state-change breakthrough. I expect: more of the same. A SOTA 30B model from 2026 is going to be ~30% better than one from 2025.
Now, expecting that to hurt Nvidia? Delusional.
No one is going to stop and say "oh wow, we got more inference efficiency - now we're going to use less compute". A lot of people are going to say "now we can use larger and more powerful models for the same price" or "with cheaper inference for the same quality, we can afford to use more inference".
Right now, Claude is good enough. If LLM development hit a magical wall and never got any better, Claude is good enough to be terrifically useful and there's diminishing returns on how much good we get out of it being at $benchmark.
Saying we're satisfied with that... well how many years until efficiency gains from one side and consumer hardware from the other meet in the middle so "good enough for everybody" open models are available for anyone who wants to pay for a $4000 MacBook (and after another couple of years a $1000 MacBook, and several more and a fancy wristwatch).
Point being, unless we get to a point where we start developing "models" that deserve civil rights and citizenship, the years are numbered to where we NEED cloud infrastructure and datacenters full of racks and racks of $x0,000 hardware.
I strongly believe the top end of the S curve is nigh, and with it we're going to see these trillion dollar ambitions crumble. Everybody is going to want a big-ass GPU and a ton of RAM but that's going to quickly become boring because open models are going to exist that eat everybody's lunch and the trillion dollar companies trying to beat them with a premium product aren't going to stack up outside of niche cases and much more ordinary cloud compute motivations.
People said that "good enough" about GPT-4. Now you say that about Claude Opus 4.5. How long before the treadmill turns, and the very same Opus 4.5 becomes "the bare minimum" - the least capable AI you would actually consider using for simple and unimportant tasks?
We have miles and miles of AI advancements ahead of us. The end of that road isn't "good enough". It's "too powerful to be survivable".
> Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that...
Hmmm
I don't know what's so special about this paper.
- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)
- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.
- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.
Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.
I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?
>Good enough? There's no such thing.
This is just wrong. Maybe you can't imagine good enough, I can. And I think "better" is going to start getting diminishing returns as the velocity of improvements I expect to slow and the value of improvements are going to become less meaningful. The "cost" of a LLM making mistakes is already pretty low, cutting it in half is better, sure, but it's so low already I don't particularly care if it gets some multiple more rare.
From Zebra-Llama's arXiv page: Submitted on 22 May 2025
But the end state in my mind is telling an AI "build me XYZ", having it ask all the important questions over the course of a 30-minute chat while making reasonable decisions on all lower-level issues, then waking up the next morning to a live cloud-hosted test environment at a subdomain of the domain it said it would buy along with test builds of native apps for Android, iOS, Linux, macOS, and Windows, all with near-100% automated test coverage and passing tests. Coding agents feel like magic, but we're clearly not there yet.
And that's just coding. If someone wanted to generate a high-quality custom feature-length movie within the usage limits of a $20/mo AI plan, they'd be sorely disappointed.
Don't forget the billion dollars or so of GPU's they had access to that they left out of that accounting. Also, the R&D cost of the Meta model they originally used. Then, they added $5.6 million on top of that.