A non-comprehensive list of ways I've seen my developers shoot themselves in the foot:
* Giant try-catch block around the message handling code to requeue messages that threw an exception. They neglected to add any accounting, so some messages would just never process. No one noticed until they saw the queue size never dropped below a certain amount during debugging.
* Queue behavior is highly dependant on configuration. Bad queue configurations result in dropped messages. Queueing systems provide few features to detect and alert on these failures (it's rally not their job), but building a system to track the integrity of the business process across queues is deemed to onerous.
* The built-in observability is generally not enough to be complete. I haven't seen a lot of great instrumentation libraries for SQS like there are for HTTP, meaning that observability is pushed on to the developer. They typically ignore that requirement because PMs rarely care until they realize we're unable to respond to incidents effectively.
* Most people vastly overestimate their scale. The number of applications I've seen built on SQS "because scale" that end up taking less than 100 QPS globally is significant. Anecdotally, I would say the majority of queue-based apps I have seen could have solved their scaling issues within HTTP.
* Many people want to treat queued messages like time-delayed HTTP requests. They are not, the semantics and design are totally different. I have seen people marshal requests to Protobuf, use it as the body of a message, and had another service read and process the request, and write another message to a queue that the first app reads back. It's basically gRPC over queues. Except that it solves none of the problems gRPC does, and creates a lot of problems. Just an example, how do you canary when you can't guarantee that the version of the app that sends the request will get the response to that request?
I think SQS is an amazing tool in the hands of people that know when to use it, and how to use it. But my experience has been that most people don't, and the ecosystem to make it available to people who aren't experts just doesn't exist yet.
The takeaway for me is: distributed systems are hard. If you have distributed workers, you have entered into a vastly more complex realm. SQS gives you some tools to work successfully in that environment, but it doesn't (and can't) get rid of that complexity. Most of the problems I've seen relate to engineers not understanding the fundamental complexity of coordinating distributed work. Your choice of tech stack for your queues isn't going to make a big difference if you don't understand what you're fundamentally dealing with.