zlacker

> Our key insight is to offload critical softmax primitives to idle tensor units, maximizing hardware utilization and throughput.

> … speedups of 1.05–1.17×across diverse attention configurations on Ampere and Hopper GPUs …