zlacker

[parent] [thread] 0 comments
1. observ+(OP)[view] [source] 2026-02-04 15:39:05
This could turbocharge ByT5 and other tokenless architectures, whose big downside was the increase in compute over longer sequences. It's easy to imagine a bunch of strategies with variable levels of "focus" and so on with a fixed compute budget assigned on the fly with learned optimizers informing the distribution.
[go to top]