Given a few million calls of clamp, most would be no-ops in practice. Modern CPUs are very good at dynamically observing this.
If your use case does not follow that pattern and you really care about performance, you have to pull out something like inline assembly.
Consider software like ffmpeg which have to do this for the sake of performance.