zlacker

What happens when you need to retry the messages from a few minutes in the past because there was a transient failure in a downstream dependency?

replies(1): >>really+Ss

>>pbourk+(OP)
S3 went down twice in 5 years. Since we're transferring files, you just push everything in the next window. The retry is trivial from the agent and accounted for in the consumer.

replies(1): >>pbourk+2W

>>really+Ss
I wasn’t talking about the reliability of S3, but of your own systems.

Say the outage results in a few million messages that need to be retried. Some subset of those few million will never succeed (aka they are “poisoned pills”). At the same time, new messages are arriving.

In your system, how do you maintain QoS for incoming messages as well as allow for the resolution of the few million retries while also preventing the poisoned pills from blocking the queue? How do you implement exponential backoff, which is the standard approach for this?

SQS gives you some simple yet powerful primitives such as the visibility timeout setting to address this scenario in a straightforward manner.