My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.
They also haven't' tried to write a high performance kernel for triton yet. If it goes the way my last experiment with Taylor did they're in for some bad news.
I'm just a hobbyist though, it's certainly possible that people with more time/resources could outperform me without much effort. I just want to see it tested on something familiar and benchmark-able.