Apart from the admin overhead (things got stuck on OS updates) we ended up abandoning the setup because the variance was too big to get anything useful out of running tests for every pr.
The most reliable way for us to monitor performance today is to count slow frames (>16ms) across all actual users and divide them by total time spent in app. It’s a rough proxy, but pretty accurate at showing us when we mess up.