zlacker

It depends.

You need 2~3 accumulators to saturate instruction-level parallelism with a parallel sum reduction. But the compiler won't do it because it only creates those when the operation is associative, i.e. (a+b)+c = a+(b+c), which is true for integers but not for floats.

There is an escape hatch in -ffast-math.

I have extensive benches on this here: https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...

replies(2): >>menaer+sq >>lifthr+OS

>>mratsi+(OP)
In my experience, compilers rarely know how to make use of ILP even in some for what you would expect to be the "simple" cases. Handwriting the SIMD, at least in my case, almost always proved to be several times faster than the auto-vectorized code generated by the compiler.

replies(1): >>mratsi+SC

>>menaer+sq
They do reorder instructions. I think the SIMD part has more to do with loop analysis than ILP.

It's quite telling that there is a #pragma omp simd to hint to a compiler to rewrite the loop.

Now I wonder what's the state of polyhedral compilers. It's been many years. And given the AI, LLMs hype they could really shine.

replies(1): >>menaer+1o1

>>mratsi+(OP)
You don't necessarily need `-ffast-math` if you are willing to tweak your code a bit, since you can just introduce accumulators by yourself. Such optimizer-friendly code is not hard to write if you know basic principles.

replies(1): >>mratsi+Df1

>>lifthr+OS
That's what I'm saying though. Either youbdo the accumulators yourself, or you need a compiler escape hatch.

replies(1): >>lifthr+Pi1

>>mratsi+Df1
I mean, you don't need SIMD to put accumulators.

replies(1): >>mratsi+qz4

>>mratsi+SC
> I think the SIMD part has more to do with loop analysis than ILP.

If you know how to rewrite the algorithm in such a way so that it makes close-to-ideal utilization of CPU ports through your SIMD then it is practically impossible to beat it. And I haven't seen a compiler (GCC, clang) doing such a thing or at least not in the instances I had written. I've measured substantial improvements from such and similar utilization of CPU-level microarchitectural details. So perhaps I don't think it's the loop analysis only but I do think it's practically an impossible task for the compiler. Perhaps with the AI ...

>>lifthr+Pi1
The compiler escape hatch is -ffast-math