Part of it was difficulty of pinpointing the actual issue - fullness of drive vs throughput of writes.
A lot of it was unfortunately organizational politics such that the system spanned two teams with different reporting lines that didn't cooperate well / had poor testing practices.
The hardest bugs in my experience are those where your only source of vital information is a third party who is straight-up lying to you.