On SQS - zlacker

>>mpweih+(OP)
I've worked with SQS at volumes in thousands of messages per second with varied (non-tiny) payload sizes.

SQS is a very simple service, which makes it fairly reliable, though part of the reason for the reliability is that the API's guarantees are weak. And it can be economical, but I've had to build a lot of non-trivial logic in order to interact with SQS robustly, performantly, and efficiently, especially around using the {Send,Receive,Delete}MessageBatch operations to reduce costs.

With the caveat that I think my use case has been quite different from what's discussed in this article, here are some of the problems I've encountered:

- Message sizes are limited, but in a convoluted way: SendMessageBatch has a 256KiB limit on the request size. Message values have a limited character set allowed, so you need to base64-encode any binary data. This also means that there's not exactly a max message size; you can batch up to 10 messages per SendMessageBatch but not in excess of 256KiB for the whole request.

- If you want to send more than 256KiBx3/4-(some padding) or around 180KiB of data for any single message, you need to put that data somewhere else and pass a pointer to it in the actual SQS message.

- SQS does routinely have temporary (edit: partial) failures that generally last for a few hours at a time. ReceiveMessageBatch may return no messages (or less than the max of 10) even if the queue has millions of messages waiting to be delivered; SQS knows it has them somewhere but it can't find them when you ask. And DeleteMessageBatch may fail for some of the messages passed while succeeding for others; it will sometimes fail repeatedly to delete those messages for an extended period.

- The SDKs provided by AWS (for either Java or Go) don't help you handle any of these things well; they just provide a window into the SQS API, and leave it to the user to figure all the details out.