zlacker

[return to "FlashAttention-T: Towards Tensorized Attention"]
1. simian+gp[view] [source] 2026-02-03 23:33:23
>>matt_d+(OP)
OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

◧◩
2. storus+dv[view] [source] 2026-02-04 00:04:12
>>simian+gp
Aren't layers basically doing n^k attention? The attention block is n^2 because it allows 1 number per input/output pair. But nothing prevents you from stacking these on top of each other and get k-th order of "attentioness" with each layer encoding a different order.
[go to top]