A statistical approach to model evaluations

>>RobinH+(OP)
All things considered, although I'm in favor of Anthropic's suggestions, I'm surprised that they're not recommending more (nominally) advanced statistical methods. I wonder if this is because more advanced methods don't have any benefits or if they don't want to overwhelm the ML community.

For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?

My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.

zlacker