zlacker

[parent] [thread] 2 comments
1. Unlist+(OP)[view] [source] 2024-11-30 02:36:08
All things considered, although I'm in favor of Anthropic's suggestions, I'm surprised that they're not recommending more (nominally) advanced statistical methods. I wonder if this is because more advanced methods don't have any benefits or if they don't want to overwhelm the ML community.

For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?

My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.

replies(1): >>philli+4R2
2. philli+4R2[view] [source] 2024-12-01 18:50:41
>>Unlist+(OP)
I wouldn't be surprised if the benefits from doing something more advanced aren't worth it.
replies(1): >>Unlist+0Xi
◧◩
3. Unlist+0Xi[view] [source] [discussion] 2024-12-07 18:42:10
>>philli+4R2
Well, I think it's usually more complicated than that. An over-simplification is that there's no free lunch.

If you use a robust sandwich estimator, you're robust against non-normality and etc, but you lower the efficiency of your estimator.

If you use bayes, the con is you have a prior and the pro is you have a prior + a lot of other things.

And strictly speaking, these are benefits on paper based on theory. In practice, of course, the drawback to using a new advanced technique is that there may be a bug(s) lurking in the software implementation that might invalidate your results.

In practice, we generally forget to account for the psychology of the analyst. Their biases, what they're willing to double-check and what they're willing to take for granted. There's also the issue of bayesian methods being somewhat confirmatory, to the point that the psychological experience of doing bayesian statistics makes one so concerned with the data generating process and of the statistical model, that one might forget to 'really check their data'.

[go to top]