zlacker

OOM is a pretty terrible benchmark too, though. You can build a DDR4 machine that "technically" loads 256gb models for maybe $1000 used, but then you've got to account for the compute aspect and that's constrained by a number of different variables. A super-sparse model might run great on that DDR4 machine, whereas a 32b model would cause it to chug.

There's just not a good way to visualize the compute needed, with all the nuance that exists. I think that trying to create these abstractions are what leads to people impulse buying resource-constrained hardware and getting frustrated. The autoscalers have a huge advantage in this field that homelabbers will never be able to match.

replies(1): >>French+W4

>>bigyab+(OP)
> time-to-first-token/token-per-second/memory-used/total-time-of-test

Would it not help with the DDR4 example though if we had more "real world" tests?

replies(1): >>bigyab+Q6

>>French+W4
Maybe, but even that fourth-order metric is missing key performance details like context length and model size/sparsity.

The bigger takeaway (IMO) is that there will never really be hardware that scales like Claude or ChatGPT does. I love local AI, but it stresses the fundamental limits of on-device compute.