zlacker

From the operations side, I feel like SQS is a very sharp pair of scissors. In the hands of a good tailor, they'll make amazing things. On the other hand, most people use them to cut things they probably shouldn't, or just flat out Sprint around with them on a wet pool deck with their shoes untied.

A non-comprehensive list of ways I've seen my developers shoot themselves in the foot:

* Giant try-catch block around the message handling code to requeue messages that threw an exception. They neglected to add any accounting, so some messages would just never process. No one noticed until they saw the queue size never dropped below a certain amount during debugging.

* Queue behavior is highly dependant on configuration. Bad queue configurations result in dropped messages. Queueing systems provide few features to detect and alert on these failures (it's rally not their job), but building a system to track the integrity of the business process across queues is deemed to onerous.

* The built-in observability is generally not enough to be complete. I haven't seen a lot of great instrumentation libraries for SQS like there are for HTTP, meaning that observability is pushed on to the developer. They typically ignore that requirement because PMs rarely care until they realize we're unable to respond to incidents effectively.

* Most people vastly overestimate their scale. The number of applications I've seen built on SQS "because scale" that end up taking less than 100 QPS globally is significant. Anecdotally, I would say the majority of queue-based apps I have seen could have solved their scaling issues within HTTP.

* Many people want to treat queued messages like time-delayed HTTP requests. They are not, the semantics and design are totally different. I have seen people marshal requests to Protobuf, use it as the body of a message, and had another service read and process the request, and write another message to a queue that the first app reads back. It's basically gRPC over queues. Except that it solves none of the problems gRPC does, and creates a lot of problems. Just an example, how do you canary when you can't guarantee that the version of the app that sends the request will get the response to that request?

I think SQS is an amazing tool in the hands of people that know when to use it, and how to use it. But my experience has been that most people don't, and the ecosystem to make it available to people who aren't experts just doesn't exist yet.

replies(1): >>cle+X1

>>currys+(OP)
I agree with most of this. If you have non-trivial message handling logic in a production system, you probably shouldn't use SQS directly to drive your work. Your SQS handling logic should be simple and reliable. In most cases, if the handling logic is complex, long-running, or needs operational visibility (logging, monitoring, etc.), I'd write the message handler itself to just kick off a workflow via Step Functions or some other workflow system. You'll pay for that in initial development costs, because it is more complicated (you need to write lambda handlers, wire them up with CloudFormation, etc.), but the tradeoff is that it gives you a central place to look at each unit of work, instead of having your artifacts scattered around various logs (if at all).

The takeaway for me is: distributed systems are hard. If you have distributed workers, you have entered into a vastly more complex realm. SQS gives you some tools to work successfully in that environment, but it doesn't (and can't) get rid of that complexity. Most of the problems I've seen relate to engineers not understanding the fundamental complexity of coordinating distributed work. Your choice of tech stack for your queues isn't going to make a big difference if you don't understand what you're fundamentally dealing with.