zlacker

> The people at Twitter who understood the system

But this is Scaling-101 stuff. It's not some super complex or unique system going wrong. At least according to the article, it's a classic case of bad retry logic leading to a death spiral.

https://en.wikipedia.org/wiki/Thundering_herd_problem

replies(1): >>PaulDa+p3

>>jayd16+(OP)
This has absolutely nothing to do with the thundering herd problem.

replies(2): >>m00x+U3 >>jayd16+7a

>>PaulDa+p3
+1.

Ironic that someone saying it's scaling 101 follows up the comment with a completely wrong explanation.

>>PaulDa+p3
Explain why not, if you please. If unresponsiveness causes increased traffic, which causes further unresponsiveness, is that not referred to as a thundering herd problem? Is the stated mitigation of a backoff not fully relevant here?

replies(1): >>inepte+Qc

>>jayd16+7a
It's the difference between one customer asking a hundred cooks for a waffle and a hundred customers asking one cook for a waffle. The former is the thundering herd (a bunch of processes trying to do something that only needs to be done once, causing resource contention) and this is akin to the latter (with the "customers" being parallel requests from the frontend).

replies(2): >>jayd16+af >>sh34r+th

>>inepte+Qc
Hmm, I was thinking it still applies in the sense that the many many duplicate retries are hitting many of Twitter's servers causing unnecessary duplicated load when a single successful response would satisfy the client and reduce the traffic.

In my mind, it is much closer to needlessly asking every server for the same information because the requests are most likely load balanced, but I guess it's true that I don't know the load balancing strategy. Even still, is it not more likely than not that those retries are hitting multiple servers?

replies(1): >>inepte+fh

>>jayd16+af
Sure, maybe? We (or at least I) know little about the actual problem here, and metaphors only go so far. But to my mind, "too many things trying to handle a request" gets a cool name because it is a fairly narrow and unusual problem, whereas "too many requests" goes by many names (DoS, hammering, flood, etc) because it's depressingly common.

>>inepte+Qc
The thundering herd problem is more like, there's a hundred cooks, one griddle, and only one of them can make an acceptable waffle for the customer.

This specific problem we're discussing, of concurrent client retries effectively launching a self-imposed DDOS attack, isn't exactly the thundering herd problem. It's clients and servers instead of threads, for one thing. But it's a good enough analogy to another type of cascading failure in concurrent computing, IMO.