The default 20% margin of error is indeed pretty wide, and it is intended to catch large obvious regressions (e.g. an algorithm accidentally becoming quadratic instead of being linear)
As we described in the blog post, we have the second system based on the real hardware. This system is on-demand. If an engineer has a suspect commit or an experimental change that might affect performance, they would schedule a 30min-1hr run on that queue, where we run selected benchmarks 9-15 times each on laptops of various strength. In this configuration, the margin of error is closer to 2-3% from our observations so far. To get more confidence, you would want to run even more trials, typically we advise 9 iterations, though.
We also do all our daily benchmarking on those laptops too.
Edit: in addition to preventative testing, we also track production metrics in a similar way as described by the sibling comment
Apart from the admin overhead (things got stuck on OS updates) we ended up abandoning the setup because the variance was too big to get anything useful out of running tests for every pr.
The most reliable way for us to monitor performance today is to count slow frames (>16ms) across all actual users and divide them by total time spent in app. It’s a rough proxy, but pretty accurate at showing us when we mess up.
I worked on a really perf sensitive system and for perf tests we would run the last x commits each time to get rid of the busy vm syndrome.
It meant that the margin of error could be much less.
You might want to consider it as a mid way step between vm’s and scheduling on laptops (those poor laptop batteries!)
Ed
Of course re-running the code from main and the PR on the same VM side by side would be the best, and it would cost a lot more money (especially once you factor in GPUs). We considered it but opted to the strategy I outlined above, it's mainly a trade-off between accuracy vs costs
Often CPU steals are visible in cloud environments. This could be useful for finding some noisy neighbor behaviors, and deciding to either adjusting expectations or rerun.
But things like IO, GPU or memory contestation also could be responsible. There are some fancy new-ish extensions for controlling memory throughput. Intel has Memory Bandwidth Allocation controls in their Resource Director Technology, which is a suite of capabilities all designed for observing & managing cross system resources. There's also controls available for setting up cache usage/allocation.
I.e I'm curious if there's a cloud provider managing them for you or you guys keep them in a closet somewhere.
For things like system updates and taking care of the hardware - we do it manually today. The fleet is still small, so it is manageable but in the future we would like to consider a vendor, if we can find one.