My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.