zlacker

[parent] [thread] 0 comments
1. noosph+(OP)[view] [source] 2026-02-04 00:09:04
Yes, and it works in theory.

Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse.

To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity.

[go to top]