zlacker

On SQS

submitted by mpweih+(OP) on 2019-05-27 06:28:44 | 343 points 225 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
12. etaioi+29[view] [source] 2019-05-27 08:34:40
>>mpweih+(OP)
I really wish SQS had reliably lower latency, like Redis, and also supported priority levels. (Also like redis, now, with sorted sets and the https://redis.io/commands/bzpopmax command.)

Has anyone measured the performance of Redis on large sorted sets, say millions of items? Hoping that it's still in single digit milliseconds at that size... And can sustain say 1000QPS...

◧◩◪
16. andrew+xa[view] [source] [discussion] 2019-05-27 08:56:29
>>varela+oa
Only since December 2018 and you need to pay extra...

https://aws.amazon.com/about-aws/whats-new/2018/12/amazon-sq...

22. redact+Gb[view] [source] 2019-05-27 09:13:35
>>mpweih+(OP)
Author of https://node-ts.github.io/bus/ here. SQS is definitely one of my most favourite message queues. The ability to have a HA managed solution without having to worry about persistence, scaling or connections is huge.

Most of the complaints are native to message based systems in general. At least once message receives, out of order receives, pretty standard faire that can be handled by applying well established patterns.

My only request would be to please increase the limits of message visibility timeouts! Often I want to delay send a message for receipt in 30 days. SQS forces me to cook some weird delete and resend recipe, or make this a responsibility of a data store. It's be really nice to do away with batch/Cron jobs and deal more with delayed queue events.

◧◩◪
27. redact+Nc[view] [source] [discussion] 2019-05-27 09:29:01
>>plasma+Ub
You're absolutely right, in fact I have a whole package that is just that https://node-ts.github.io/bus/packages/bus-workflow/.

The problem is this. Let's say that I want to trigger a step in a "free trial" saga that sends an email to the customer 10 days after they sign up nudging them to get a paid account. If I can delay send this message for 10 days then it's easy.

However because SQS has a much shorter visibility timeout, I have to find a much more roundabout way of triggering that action.

◧◩◪
31. emmela+Nd[view] [source] [discussion] 2019-05-27 09:42:18
>>archgo+H7
FWIW there is a queue based on maildir which has implementations in Perl, Python, C and Java and probably more.

The Perl implementation was the original AFAIK.

http://search.cpan.org/dist/Directory-Queue/

◧◩◪◨
37. tybit+gh[view] [source] [discussion] 2019-05-27 10:28:10
>>cies+Ff
If you’re message consumer isn’t idempotent then no MQ can help you. Exactly once delivery is impossible other than with at least once delivery and an idempotent consumer.

https://bravenewgeek.com/tag/amazon-sqs/

◧◩
42. hexene+yj[view] [source] [discussion] 2019-05-27 10:55:46
>>cyberf+2b
Possibly this: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQS...
◧◩
46. LunaSe+nl[view] [source] [discussion] 2019-05-27 11:18:53
>>andrew+7a
Do you have an example of such a queue somewhere?

I think I have a rough idea about how it works because I implemented something similar about four years ago in PostgreSQL but kept getting locking issues I couldn't get out of:

- https://stackoverflow.com/questions/33467813/concurrent-craw...

- https://stackoverflow.com/questions/29807033/postgresql-conc...

Also, what kind of queue size / concurrency on the polling side are you able to sustain for your current hardware?

◧◩◪
52. soroko+9m[view] [source] [discussion] 2019-05-27 11:28:30
>>LunaSe+nl
Here https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock... is a good discussion.
◧◩
88. blr246+OF[view] [source] [discussion] 2019-05-27 14:33:22
>>hexene+ci
Kinesis is not necessarily well-suited fan-out. It is very well suited for fan-in (single consumer, multiple producers).

Each shard allows at most 5 GetRecords operations per second. If you want to fan out to many consumers, you will reach those limits quickly and have to implement a significant latency/throughput tradeoff to make it work.

For API limits, see: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_...

◧◩
91. owenma+vG[view] [source] [discussion] 2019-05-27 14:40:07
>>Kiro+Rm
https://ferd.ca/queues-don-t-fix-overload.html explains it better than pretty much anything I’ve read on the topic.
◧◩
106. gizzlo+dJ[view] [source] [discussion] 2019-05-27 15:07:21
>>polski+BF
Guess it depends on the definition of "queue". Potentials:

  - https://nsq.io/
  - https://nats.io/
◧◩
122. appwiz+MO[view] [source] [discussion] 2019-05-27 15:57:56
>>oceanb+Ql
AWS AppSync (https://aws.amazon.com/appsync/) is a better fit for the chat room use case because of server pushed events over a persistent connection (WebSockets). Launch the Chat sample in the console to try it out.
◧◩◪◨
124. archgo+nQ[view] [source] [discussion] 2019-05-27 16:07:54
>>emmela+Nd
Interesting; looks like DirectoryQueue uses directories, rather than file locks (man 2 flock), to lock the queue messages. This might actually work, since mkdir returns an error if you attempt to create a directory that already exists. The implementation seems to be handling most of the obvious failure cases, or at least tries to.

https://metacpan.org/release/Directory-Queue/source/lib/Dire...

So how does one lock a message in s3? Does s3 have a "createIfDoesNotExistOrError"? I'm still having difficulty understanding how the proposed system avoids race conditions.

◧◩◪◨
134. chucks+lV[view] [source] [discussion] 2019-05-27 16:46:30
>>tjanks+RL
Intentionally so. It's not a deficiency or a footgun, it's a design decision to be aware of. Redis is an in-memory database first.

You can configure Redis for durability. The docs[1] page for persistence has a good examination of the pros and cons.

[1]: https://redis.io/topics/persistence

◧◩◪◨
143. kwindl+401[view] [source] [discussion] 2019-05-27 17:20:55
>>andrew+JI
Thanks for posting that code!

Definitely similar experience here. We handle ~10 million messages a day in a pubsub system quite similar in spirit to the above, running on AWS Aurora MySQL.

Our system isn't a queue. We track a little bit of short-lived state for groups of clients, and do low-latency, in-order message delivery between clients. But a lot of the architecture concerns are the same as with your queue implementation.

We switched over to our own pubsub code, implemented the simplest way we could think of, on top of vanilla SQL, after running for several months on a well-regarded SaaS NoSQL provider. After it became clear that both reliability and scaling were issues, we built several prototypes on top of other infrastructure offerings that looked promising.

We didn't want to run any infrastructure ourselves, and didn't want to write this "low-level" message delivery code. But, in the end, we felt that we could achieve better system observability, benchmarking, and modeling, with much less work, using SQL to solve our problems.

For us, the arguments are pretty much Dan McKinley's from the Choose Boring Technology paper.[0]

It's definitely been the right decision. We've had very few issues with this part of our codebase. Far, far fewer than we had before, when we were trying to trace down failures in code we didn't write ourselves on hardware that we had no visibility into at all. This has turned out to be a counter-data point to my learned aversion to writing any code if somebody else has already written and debugged code that I can use.

One caveat is that I've built three or four pubsub-ish systems over the course of my career, and built lots and lots of stuff on top of SQL databases. If I had 20 years of experience using specific NoSQL systems to solve similar problems, those would probably qualify as "boring" technology, to me, and SQL would probably seem exotic and full of weird corner cases. :-)

[0] - https://mcfunley.com/choose-boring-technology

◧◩◪◨
145. mmarti+J01[view] [source] [discussion] 2019-05-27 17:26:28
>>haolez+LT
The learning curve I've experienced with Elixir, after working previously with managed services, in handling the above-mentioned tasks while managing state in the BEAM cluster. Patches are scaling are straightforward if you can restart instances and assume they can pick up what was interrupted before, but hot-reloading or managing state between nodes in a rolling update with give you overhead as you get set up.

What it works really great for if you don't want to do the up front investment in managing a stateful cluster, is doing multi-step or fan-out processing. BEAM/OTP really shines when it's helpful to have individual processing steps coordinated but isolated, but where if a job needs to cancel and rerun (interrupted by a node restart or OOM), it's not an issue.

This is great resource https://www.erlang-in-anger.com/

◧◩◪
169. floatb+Ps1[view] [source] [discussion] 2019-05-27 21:50:03
>>LunaSe+nl
There's a few libraries implementing this:

https://github.com/mbuhot/ecto_job for Elixir

https://github.com/timgit/pg-boss for Node.js

◧◩◪◨⬒
190. emmela+xK1[view] [source] [discussion] 2019-05-28 01:59:00
>>archgo+nQ
maildir is specified at https://cr.yp.to/proto/maildir.html and billions of message uses it each year. So that's pretty safe.

I can't vouch for the queueing code but I believe it's quite robust too.

◧◩◪◨⬒
199. polski+742[view] [source] [discussion] 2019-05-28 06:55:15
>>maniga+XD1
Is this what you meant ?

https://github.com/nats-io/nats-streaming-server/issues/168

◧◩◪◨⬒⬓
220. maniga+6j3[view] [source] [discussion] 2019-05-28 17:53:05
>>polski+SU1
Nats streaming isn't just a persistence layer to NATS. It's an entirely different system that basically acts as a client to NATS and then records messages it sees. Basically think of how you would design a persistent queue on top of the ephemeral NATS pub/sub and that's what NATS streaming is.

Here's a good post (and series) about distributed logs and NATS design issues: https://bravenewgeek.com/building-a-distributed-log-from-scr...

◧◩◪◨⬒⬓
221. maniga+7j3[view] [source] [discussion] 2019-05-28 17:53:17
>>tuxych+0N1
See the other comment but this is a good post/series: https://bravenewgeek.com/building-a-distributed-log-from-scr...
[go to top]