I wonder if there's a connection to your Taylor truncation order. In RG terms, higher-order polynomial interactions are "irrelevant operators"—they get suppressed as you flow toward the fixed point. If trained attention heads are sitting near this fixed point, that might explain why modest truncation orders work: the network has already learned to concentrate its computation in the lower-order terms. A testable prediction: layers with α closer to 2 (measurable via weightwatcher https://github.com/CalculatedContent/WeightWatcher) might need fewer Taylor terms for accurate approximation than layers with α far from 2. If true, you could potentially use the spectral statistics to adaptively choose truncation order per-head.