https://arxiv.org/abs/2509.04664
According to that OpenAI paper, models hallucinate in part because they are optimized on benchmarks that involve guessing. If you make a model that refuses to answer when unsure, you will not get SOTA performance on existing benchmarks and everyone will discount your work. If you create a new benchmark that penalizes guessing, everyone will think you are just creating benchmarks that advantage yourself.
https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
The real reason is that every bench I've seen has Anthropic with lower hallucinations.