Has anyone measured the performance of Redis on large sorted sets, say millions of items? Hoping that it's still in single digit milliseconds at that size... And can sustain say 1000QPS...
https://aws.amazon.com/about-aws/whats-new/2018/12/amazon-sq...
Most of the complaints are native to message based systems in general. At least once message receives, out of order receives, pretty standard faire that can be handled by applying well established patterns.
My only request would be to please increase the limits of message visibility timeouts! Often I want to delay send a message for receipt in 30 days. SQS forces me to cook some weird delete and resend recipe, or make this a responsibility of a data store. It's be really nice to do away with batch/Cron jobs and deal more with delayed queue events.
The problem is this. Let's say that I want to trigger a step in a "free trial" saga that sends an email to the customer 10 days after they sign up nudging them to get a paid account. If I can delay send this message for 10 days then it's easy.
However because SQS has a much shorter visibility timeout, I have to find a much more roundabout way of triggering that action.
The Perl implementation was the original AFAIK.
I think I have a rough idea about how it works because I implemented something similar about four years ago in PostgreSQL but kept getting locking issues I couldn't get out of:
- https://stackoverflow.com/questions/33467813/concurrent-craw...
- https://stackoverflow.com/questions/29807033/postgresql-conc...
Also, what kind of queue size / concurrency on the polling side are you able to sustain for your current hardware?
Each shard allows at most 5 GetRecords operations per second. If you want to fan out to many consumers, you will reach those limits quickly and have to implement a significant latency/throughput tradeoff to make it work.
For API limits, see: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_...
- https://nsq.io/
- https://nats.io/https://metacpan.org/release/Directory-Queue/source/lib/Dire...
So how does one lock a message in s3? Does s3 have a "createIfDoesNotExistOrError"? I'm still having difficulty understanding how the proposed system avoids race conditions.
You can configure Redis for durability. The docs[1] page for persistence has a good examination of the pros and cons.
Definitely similar experience here. We handle ~10 million messages a day in a pubsub system quite similar in spirit to the above, running on AWS Aurora MySQL.
Our system isn't a queue. We track a little bit of short-lived state for groups of clients, and do low-latency, in-order message delivery between clients. But a lot of the architecture concerns are the same as with your queue implementation.
We switched over to our own pubsub code, implemented the simplest way we could think of, on top of vanilla SQL, after running for several months on a well-regarded SaaS NoSQL provider. After it became clear that both reliability and scaling were issues, we built several prototypes on top of other infrastructure offerings that looked promising.
We didn't want to run any infrastructure ourselves, and didn't want to write this "low-level" message delivery code. But, in the end, we felt that we could achieve better system observability, benchmarking, and modeling, with much less work, using SQL to solve our problems.
For us, the arguments are pretty much Dan McKinley's from the Choose Boring Technology paper.[0]
It's definitely been the right decision. We've had very few issues with this part of our codebase. Far, far fewer than we had before, when we were trying to trace down failures in code we didn't write ourselves on hardware that we had no visibility into at all. This has turned out to be a counter-data point to my learned aversion to writing any code if somebody else has already written and debugged code that I can use.
One caveat is that I've built three or four pubsub-ish systems over the course of my career, and built lots and lots of stuff on top of SQL databases. If I had 20 years of experience using specific NoSQL systems to solve similar problems, those would probably qualify as "boring" technology, to me, and SQL would probably seem exotic and full of weird corner cases. :-)
What it works really great for if you don't want to do the up front investment in managing a stateful cluster, is doing multi-step or fan-out processing. BEAM/OTP really shines when it's helpful to have individual processing steps coordinated but isolated, but where if a job needs to cancel and rerun (interrupted by a node restart or OOM), it's not an issue.
This is great resource https://www.erlang-in-anger.com/
https://github.com/mbuhot/ecto_job for Elixir
https://github.com/timgit/pg-boss for Node.js
I can't vouch for the queueing code but I believe it's quite robust too.
Here's a good post (and series) about distributed logs and NATS design issues: https://bravenewgeek.com/building-a-distributed-log-from-scr...