zlacker

Say you have a bug that caused 100,000 HTTP requests to hang, and you kick the node and make them all fail at once. One second later, 100,000 clients suddenly retry simultaneously, causing a huge spike in load which makes most of their requests fail. They use exponential backoff, so two seconds after that, 99,000 clients retry, causing a huge spike in load that makes most of their requests fail. Four seconds after that, 98,000 clients retry...

If you introduce a bit of randomness into the retry timing (say, multiply by 1.8~2.2 instead of a straight doubling), that thundering herd will spread itself out and be much easier to recover from.