The default 20% margin of error is indeed pretty wide, and it is intended to catch large obvious regressions (e.g. an algorithm accidentally becoming quadratic instead of being linear)
As we described in the blog post, we have the second system based on the real hardware. This system is on-demand. If an engineer has a suspect commit or an experimental change that might affect performance, they would schedule a 30min-1hr run on that queue, where we run selected benchmarks 9-15 times each on laptops of various strength. In this configuration, the margin of error is closer to 2-3% from our observations so far. To get more confidence, you would want to run even more trials, typically we advise 9 iterations, though.
We also do all our daily benchmarking on those laptops too.
Edit: in addition to preventative testing, we also track production metrics in a similar way as described by the sibling comment
I worked on a really perf sensitive system and for perf tests we would run the last x commits each time to get rid of the busy vm syndrome.
It meant that the margin of error could be much less.
You might want to consider it as a mid way step between vm’s and scheduling on laptops (those poor laptop batteries!)
Ed