https://en.wikipedia.org/wiki/Rosenhan_experiment
This one is more positive but is checking that different diagnosticians get the same answer
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5980511/
and if that was applied to the "Thud" experiment you'd have poor diagnosis with a very high kappa (interrater agreement)