zlacker

Seeing a 20% margin of error for some of their tests (due to VMs, noisy neighbors), makes me curious how others solve this problem. Dedicated hardware seems like a good investment, otherwise you need to constantly overcompensate with multiple runs.

replies(3): >>imslav+64 >>koenbo+E9 >>jaunty+pa1

>>yellow+(OP)
Hey, I am one of the authors of the article and the systems described.

The default 20% margin of error is indeed pretty wide, and it is intended to catch large obvious regressions (e.g. an algorithm accidentally becoming quadratic instead of being linear)

As we described in the blog post, we have the second system based on the real hardware. This system is on-demand. If an engineer has a suspect commit or an experimental change that might affect performance, they would schedule a 30min-1hr run on that queue, where we run selected benchmarks 9-15 times each on laptops of various strength. In this configuration, the margin of error is closer to 2-3% from our observations so far. To get more confidence, you would want to run even more trials, typically we advise 9 iterations, though.

We also do all our daily benchmarking on those laptops too.

Edit: in addition to preventative testing, we also track production metrics in a similar way as described by the sibling comment

replies(2): >>ed_ell+Fl >>yellow+Ut1

>>yellow+(OP)
At Framer (framer.com) we actually had a bit lower variance (I think it was closer to 15%) after spending three months of solid engineering, ending up with a dedicated Mac Mini running Chrome.

Apart from the admin overhead (things got stuck on OS updates) we ended up abandoning the setup because the variance was too big to get anything useful out of running tests for every pr.

The most reliable way for us to monitor performance today is to count slow frames (>16ms) across all actual users and divide them by total time spent in app. It’s a rough proxy, but pretty accurate at showing us when we mess up.

>>imslav+64
Hey,

I worked on a really perf sensitive system and for perf tests we would run the last x commits each time to get rid of the busy vm syndrome.

It meant that the margin of error could be much less.

You might want to consider it as a mid way step between vm’s and scheduling on laptops (those poor laptop batteries!)

Ed

replies(1): >>imslav+er

>>ed_ell+Fl
That's a good way to address the noise on VMs! We do something different but in a similar spirit: when we compare to the main branch, we calculate the baseline based on 1-2 weeks worth of historical data on main (we identify the latest step change with a simple linear regression). This way we approximate the baseline based on ~100 data points which also helps to address the variance.

Of course re-running the code from main and the PR on the same VM side by side would be the best, and it would cost a lot more money (especially once you factor in GPUs). We considered it but opted to the strategy I outlined above, it's mainly a trade-off between accuracy vs costs

>>yellow+(OP)
This is definitely an interesting challenge!

Often CPU steals are visible in cloud environments. This could be useful for finding some noisy neighbor behaviors, and deciding to either adjusting expectations or rerun.

But things like IO, GPU or memory contestation also could be responsible. There are some fancy new-ish extensions for controlling memory throughput. Intel has Memory Bandwidth Allocation controls in their Resource Director Technology, which is a suite of capabilities all designed for observing & managing cross system resources. There's also controls available for setting up cache usage/allocation.

replies(1): >>yellow+8u1

>>imslav+64
Thanks for your reply. Can you share how these laptops are managed? I took another look at the article but didn't find that information there.

I.e I'm curious if there's a cloud provider managing them for you or you guys keep them in a closet somewhere.

replies(1): >>imslav+fn3

>>jaunty+pa1
Yeah, I think I/O has bit us before. Running on AWS, you may randomly get a huge spike in I/O latency (maybe Amazon adjusted the physical hardware, drive failed, etc) and have no idea why.

>>yellow+Ut1
We omitted the details for the sake of brevity (the article is already really long), but we run agents on the test devices and plug them into CI. Agents allow remote access for servicing remotely, and that's enough most of the time.

For things like system updates and taking care of the hardware - we do it manually today. The fleet is still small, so it is manageable but in the future we would like to consider a vendor, if we can find one.

replies(1): >>yellow+jx3

>>imslav+fn3
Thanks!