zlacker

wow. how would one even fix this without deliberate downtime? you'd have to deploy and hope that the frontend will make it through CDNs to reduce pressure, right?

replies(4): >>minima+F >>whatev+R2 >>bornfr+c6 >>avl999+ha

>>sgammo+(OP)
At minimum, you revert the commit/deploy to prod that caused the issue. But then that would likely mean reverting the recent policies and would make Elon look weak, so he'd never support it.

>>sgammo+(OP)
Yeah frontend retry DDoS is not a great situation to get in. I've tripped it in a test env before with a websocket app (erroneous retries caused certain clients to open the WS over and over and break the server).

>>sgammo+(OP)
You first remove rate limits, then implement and release exponential backoff on frontend, then apply rate limits again (on a small segment of users first, then more). No biggie, you just need to be very careful. And boss needs to chill for that time, which is unlikely to happen.

>>sgammo+(OP)
First thing I would try is seeing if the front end has a different retry strategy for a different status code (say 503). If so I'd change the status returned for throttling to be that (503).

Barring that, turning off server side throttling or atleast making it less aggressive to slow the retry storm seems the most reasonable.