That's a good way to address the noise on VMs! We do something different but in a similar spirit: when we compare to the main branch, we calculate the baseline based on 1-2 weeks worth of historical data on main (we identify the latest step change with a simple linear regression). This way we approximate the baseline based on ~100 data points which also helps to address the variance.
Of course re-running the code from main and the PR on the same VM side by side would be the best, and it would cost a lot more money (especially once you factor in GPUs). We considered it but opted to the strategy I outlined above, it's mainly a trade-off between accuracy vs costs