I believe this is "always" rather than often when it comes to the actual operations defined by the FP standard. gcc does play it fast and loose (as -ffast-math is not yet enabled by default, and FMA on the other hand is), but this is technically illegal and at least can be easily configured to be in standards-compliant mode.
I think the bigger problem comes from what is _not_ documented by the standard. E.g. transcendental functions. A program calling plain old sqrt(x) can find itself behaving differently _even between different stepping of the same core_, not to mention that there are well-known differences between AMD vs Intel. This is all using the same binary.
Unless of course we are talking about the 80 bit format.
If that's not the case, would be interested to know where they differ.
Unfortunately for the transcendental function the accuracy still hasn't been pinned down, especially since that's still an ongoing research problem.
There's been some great strides in figuring out the worst cases for binary floating point up to doubles so hopefully an upcoming standard will stipulate 0.5 ULP for transcendentals. But decimal floating point still has a long way to go.
Every 754 architecture (including SSE) I've worked on has an accurate sqrt().
I'm assuming you're talking about with "fast math" enabled? In which case all bets are off anyway!
Now, there is also often an approximate rsqrt and approximate reciprocal, with varying degrees of accuracy, and that can be "fun."
Or maybe the library you use...
FMAs were difficult. The Visual Studio compiler in particular didn't support purposeful FMAs for SSE instructions so you had to rely on the compiler to recognise and replace multiply-additions. Generally I want FMAs because they're more accurate but I want to control where they go.