zlacker

[parent] [thread] 0 comments
1. storus+(OP)[view] [source] 2026-02-04 00:04:12
Aren't layers basically doing n^k attention? The attention block is n^2 because it allows 1 number per input/output pair. But nothing prevents you from stacking these on top of each other and get k-th order of "attentioness" with each layer encoding a different order.
[go to top]