zlacker

This is why you always use exponential backoff.

replies(4): >>fathyb+q3 >>global+A3 >>stan_k+s4 >>Waterl+W4

>>brigad+(OP)
And when you're at Twitter scale, sprinkle some jitter too.

replies(1): >>oblio+cf

>>brigad+(OP)
This is why you SHOULD always use exponential backoff. ;)

replies(1): >>mmastr+z4

>>brigad+(OP)
self ddos with backoff, :chef kiss:

>>global+A3
There may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

(thanks RFC 2119)

>>brigad+(OP)
I wonder if exponential backoff should be the default behaviour for request libraries/APIs.

Their default of “just go ham on that API” feels like the same footgun of “by default this Humongous Database is wide open.”

replies(1): >>DropIn+rt

>>fathyb+q3
What do you mean?

replies(3): >>jyxent+If >>wolfga+7g >>8organ+Nh

>>oblio+cf
Adding some randomization to the exponential backoff times to avoid the thundering herd problem: https://en.wikipedia.org/wiki/Thundering_herd_problem

>>oblio+cf
Say you have a bug that caused 100,000 HTTP requests to hang, and you kick the node and make them all fail at once. One second later, 100,000 clients suddenly retry simultaneously, causing a huge spike in load which makes most of their requests fail. They use exponential backoff, so two seconds after that, 99,000 clients retry, causing a huge spike in load that makes most of their requests fail. Four seconds after that, 98,000 clients retry...

If you introduce a bit of randomness into the retry timing (say, multiply by 1.8~2.2 instead of a straight doubling), that thundering herd will spread itself out and be much easier to recover from.

>>oblio+cf
Jitter is a little randomness in how long clients wait between retries. It ensures that you don't have a "thundering herd" all retrying at the same time. Imagine if your API used exponential backoff of [1s, 2s, 4s, 8s, ...] and a large group of requests gets a retryable error at t=0. They will all retry at exactly t=1, t=2, etc. If the group is large enough that repeated surge of requests can knock you offline.

There's nasty form of this where the site is offline for a bit and then all the clients rush their requests in when it comes back online. The client requests are all coordinated on the site recovery time and end up overloading the site with their coordinated retries.

replies(4): >>henry2+9l >>nights+iy >>doomle+GD >>bezout+3q1

>>8organ+Nh
I enjoyed your comment a lot. Interesting to think that abstract structures like a load balanced webserver have a simil to the fundamental frequency observer in physical structure

>>Waterl+W4
I doubt it ever would become the standard unless everyone was using third party libraries that forced it in some way, most likely opaque by default which would cause plenty of devs headaches, right?

The easiest path will always be the default for the majority of devs, with a simple "timer" type solution being the easiest to implement in pretty much all cases except where otherwise it's literally forced on them.

>>8organ+Nh
Also useful in caching mechanisms.

>>8organ+Nh
Ah, so it’s like CDMA in WLAN, TIL

>>8organ+Nh
TIL about jitter