For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?
My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.