If you introduce a bit of randomness into the retry timing (say, multiply by 1.8~2.2 instead of a straight doubling), that thundering herd will spread itself out and be much easier to recover from.
There's nasty form of this where the site is offline for a bit and then all the clients rush their requests in when it comes back online. The client requests are all coordinated on the site recovery time and end up overloading the site with their coordinated retries.