zlacker

[parent] [thread] 0 comments
1. eldenr+(OP)[view] [source] 2026-02-03 23:41:46
This is a common way of thinking. In practice this type of thing is more like optimizing flop allocation. Surely with an infinite compute and parameter budget you could have a better model with more intensive operations.

Another thing to consider is that transformers are very general computers. You can encode many many more complex architectures in simpler, multi layer transformers.

[go to top]