Video presentation if someone prefers it: https://www.youtube.com/watch?v=PN3nYBowSvM
Linear attention is a first-degree approximation of Softmax attention, and model performance gets better as you increase the degree of the Taylor approximation.
I'm thinking about adapting an existing model to Taylor-approximated attention. I think it should be possible with some model surgery and rehabilitation training.