On SQS - zlacker

>>mpweih+(OP)
A long time ago, as new-ish developer, I was building a system that needed to take inputs, then run "pass/fail/wait and try again later" until timeout or completion. This wasn't mission-critical stuff, mind you, so a lost message would annoy someone but not cause any actual harm.

As I was figuring out how to setup a datastore, query it for running workflows and all that jazz, I happened upon an interesting SQS feature: Post with Delay.

And so, the system has no database. Instead, when new work arrives it posts the details of the work to be done to SQS. All hosts in the fleet are polling SQS for messages. When they receive one, they do the checks and if the process isn't complete they repost the message again with a 5-minute delay. In 5 minutes, a host in the fleet will receive the message and try again. The process continues as long as it needs to.

Looking back, part of me now is horrified at this design. But: that system now has thousands of users and continues to scale really well. Data loss is very rare. Costs are low. No datastore to manage. SQS is just really darned neat because it can do things like that.

>>mabbo+kK
Why are you horrified at a design that works well, scales well, is resilient enough for its use case, and is low cost? The whole point of an engineering design process is to find designs that meet these types of requirements. Honestly, this sounds like the perfect solution for what you're trying to accomplish.

>>bkanbe+ZT
Because looking at it now, something feels deeply wrong about it, haha. Honestly, if I'd used a database it probably would have opened up a few more options for future work.

I can't do any analytics about how long things typically take, who my biggest users are, etc. I mean, I could, but I'd have to add a datastore for that.

Adding new details to the parameters of the system requires very careful work to make all changes backwards and forwards compatible so that mid-deployment we don't have messages being pushed that old hosts can't process or new hosts seeing old messages they don't understand. That's good practice generally, but it's super mission critical to get right this way.

Also, a dropped message is invisible. SQS has redrive, sure, and that helps but if there were a bug, an edge case, where the system stopped processing something and quietly failed, that processing would just stop and we'd never know. If the entries were in a datastore, we'd see "Hey, this one didn't finish and I havne't worked on it lately, what gives?".