Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.
Given a few million calls of clamp, most would be no-ops in practice. Modern CPUs are very good at dynamically observing this.
On surrounding code - for sure.
A seasoned hardware architect once told me that Intel went all-in on predication for Itanium, under the assumption that a Sufficiently Smart Compiler could figure it out, and then discovered to their horror that their compiler team's best efforts were not Sufficiently Smart. He implied that this was why Intel pushed to get a profile-guided optimization step added to the SPEC CPU benchmark, since profiling was the only way to get sufficiently accurate data.
I've never gone back to see whether the timeline checks out, but it's a good story.
By avoiding conditional branches and essentially masking out some instructions, you can avoid stalls and mis-predictions and keep the pipeline full.
Actually I think @IainIreland mis-remembers what the seasoned architect told him about Itanium. While Itanium did support predicated instructions, the problematic static scheduling was actually because Itanium was a VLIW machine: https://en.wikipedia.org/wiki/VLIW .
TL;DR: dynamic scheduling on superscalar out-of-order processors with vector units works great and the transistor overhead got increasingly cheap, but static scheduling stayed really hard.
If your use case does not follow that pattern and you really care about performance, you have to pull out something like inline assembly.
Consider software like ffmpeg which have to do this for the sake of performance.