It's quite telling that there is a #pragma omp simd to hint to a compiler to rewrite the loop.
Now I wonder what's the state of polyhedral compilers. It's been many years. And given the AI, LLMs hype they could really shine.
If you know how to rewrite the algorithm in such a way so that it makes close-to-ideal utilization of CPU ports through your SIMD then it is practically impossible to beat it. And I haven't seen a compiler (GCC, clang) doing such a thing or at least not in the instances I had written. I've measured substantial improvements from such and similar utilization of CPU-level microarchitectural details. So perhaps I don't think it's the loop analysis only but I do think it's practically an impossible task for the compiler. Perhaps with the AI ...