zlacker

Compilers often under-generate conditional instructions. They implicitly assume (correctly) that most branches you write are 90/10 (ie very predictable), not 50/50. The branches that actually are 50/50 suffer from being treated as being 90/10.

replies(2): >>fooker+ma >>IainIr+rE1

>>pclmul+(OP)
The branches in this example are not 50/50.

Given a few million calls of clamp, most would be no-ops in practice. Modern CPUs are very good at dynamically observing this.

replies(1): >>pclmul+0M2

>>pclmul+(OP)
It's hard to predict statically which branches will be dynamically unpredictable.

A seasoned hardware architect once told me that Intel went all-in on predication for Itanium, under the assumption that a Sufficiently Smart Compiler could figure it out, and then discovered to their horror that their compiler team's best efforts were not Sufficiently Smart. He implied that this was why Intel pushed to get a profile-guided optimization step added to the SPEC CPU benchmark, since profiling was the only way to get sufficiently accurate data.

I've never gone back to see whether the timeline checks out, but it's a good story.

replies(1): >>fooker+7F1

>>IainIr+rE1
The compiler doesn't do much of the predicting, it's done by the CPU in runtime.

replies(1): >>kybore+4P1

>>fooker+7F1
Not prediction, predication: https://en.wikipedia.org/wiki/Predication_(computer_architec...

By avoiding conditional branches and essentially masking out some instructions, you can avoid stalls and mis-predictions and keep the pipeline full.

Actually I think @IainIreland mis-remembers what the seasoned architect told him about Itanium. While Itanium did support predicated instructions, the problematic static scheduling was actually because Itanium was a VLIW machine: https://en.wikipedia.org/wiki/VLIW .

TL;DR: dynamic scheduling on superscalar out-of-order processors with vector units works great and the transistor overhead got increasingly cheap, but static scheduling stayed really hard.

>>fooker+ma
Do you know that for a fact? For all calls of clamp? I have definitely used min and max when they are true 50/50s and I assume clamp also gets some similar use.

replies(1): >>fooker+vY2

>>pclmul+0M2
Modern compilers generate code assuming all branches are highly predictable.

If your use case does not follow that pattern and you really care about performance, you have to pull out something like inline assembly.

Consider software like ffmpeg which have to do this for the sake of performance.