> Oh but what about ORDERED queues? The only way to get ordered application of writes is to perform them one after the other.
This is another WTF. Talking about ordered queues is like talking about databases, because it's data that's structured. If you can feed data from concurrent sources of unordered data to a system where access can be ordered, you have access to a sorted data. You deal with out-of-order data either in the insertions or a window in the processing or in the consumers. "Write in order" is not a requirement, but an option. Talking about technical subjects on twitter always results in some mind-numbingly idiotic statements for the sake of 144 characters.
It would seem to me that naively, S3 charges $5 per million POST requests, so it's 10x worse than SQS's $0.40 per million.
Interesting. Can you expand on this? How do you ensure that only one worker takes a message from s3? Or do you only use this setup when you have only one worker?
Just because you can, doesn’t mean you should.
Has anyone measured the performance of Redis on large sorted sets, say millions of items? Hoping that it's still in single digit milliseconds at that size... And can sustain say 1000QPS...
I used to use SQS but Postgres gives me everything I want. I can also do priority queueing and sorting.
I gave up on SQS when it couldn't be accessed from a VPC. AWS might have fixed that now.
All the other queueing mechanisms I investigated were dramatically more complex and heavyweight than Postgres SKIP LOCKED.
https://aws.amazon.com/about-aws/whats-new/2018/12/amazon-sq...
If you have a short visibility timeout that it appears back on the queue before you delete it... Sure.
Most of the complaints are native to message based systems in general. At least once message receives, out of order receives, pretty standard faire that can be handled by applying well established patterns.
My only request would be to please increase the limits of message visibility timeouts! Often I want to delay send a message for receipt in 30 days. SQS forces me to cook some weird delete and resend recipe, or make this a responsibility of a data store. It's be really nice to do away with batch/Cron jobs and deal more with delayed queue events.
Occasionally we use to have all workers tied up on a single customers long running tasks, we mitigated by using a throttler we wrote that can defer a job if too many resources are in use by the customer, but it’s not ideal.
I’d love a priority based, customer throttled (eg max concurrent tasks) queue.
We can prioritize by low/medium/high using separate queues, and could make a set of queues per customer; but that is starting to explode how many queues we have and feels unmanageable.
You can imagine building a saga system on top of a queue system.
Actually most of the prioritization could be implemented through additional DB. With SQS in most cases you need persistent reflection of job to keep its status, process times, results. You can put to queue only few items that are highest priority to guarantee that workers are busy next 10-30 minutes.
The problem is this. Let's say that I want to trigger a step in a "free trial" saga that sends an email to the customer 10 days after they sign up nudging them to get a paid account. If I can delay send this message for 10 days then it's easy.
However because SQS has a much shorter visibility timeout, I have to find a much more roundabout way of triggering that action.
But yes, a quirk.
Was hoping to avoid having "side car" infrastructure for this, but I don't think I can escape it :)
The Perl implementation was the original AFAIK.
can somebody expand on this? I know about the 14 days limitation but this makes it sound like you can store messages for a long time and still recover them somehow?
Most of them are designed and intended as a persistent data store. Data loss for a queue system is typically something you do not want
Tbh purpose-built queues are taken way too eagerly by programmers who later end up needing the flexibility offered by a more general data store.
I think I have a rough idea about how it works because I implemented something similar about four years ago in PostgreSQL but kept getting locking issues I couldn't get out of:
- https://stackoverflow.com/questions/33467813/concurrent-craw...
- https://stackoverflow.com/questions/29807033/postgresql-conc...
Also, what kind of queue size / concurrency on the polling side are you able to sustain for your current hardware?
What does this actually mean in practical terms?
Can you explain this? Don't many applications deliver once and only once via locking? It's obviously easier as an application developer to say "I will only get this once" and accept losing messages than dealing with idempotence particularly in distributed services.
I am more interested in diagnosing successful SQS deliveries after the fact, to see what the payloads were in case there was a downstream problem.
It seems that SQS deliveries that don't get a 200 response from our service go to the DLQ, but those that get a successful 200 disappear into the ether.
Also, as another respondent replied - there is no real deliverability guarantee, although there are certain ways to handle that within SQS.
You can replace Redis with pretty much any message queue though like RabbitMQ which has a better consumption story. The main advantage of using any one of these would be the throughout you can achieve and it decouples your database at that point which a lot of people prefer.
What do y'all do?
I don't see a downside to this approach. Perhaps some increased latency?
As for acking, I see two common methods: using an additional boolean column, something like is_processed. Consumers skip truthy ones. Or, after the work is done, simply delete the entry or move elsewhere (e.g. For archival / auditing).
Offcourse one could delay the commit until all processing is completed but then reasoning about the queue throughput becomes tricky.
While dealing with at-most-once delivery is easier in isolation as an application developer on the consumer side, dealing with lost messages is in practice MUCH MUCH harder on the producer side than idempotent handling where required on the consumer side. You end up building elaborate mechanisms for locking and retries and receipt validation which can all fail.
Just think about emails, there are a lot of situations where you can't be 100% certain whether the other side received the message or not. For something low-priority it may be better to ignore partial failures, but if it's important it may be better to send a second message to guarantee delivery. If you're modeled on never sending the same message twice AND the messages matter, you're in trouble.
If you ack before processing, and then you crash, those messages are lost (assuming you can't recover from the crash and you are not using something like a two-phase commit).
If you ack after processing, you may fail after the messages have been processed but before you've been able to ack them. This leads to duplicates, in which case you better hope your work units are idempotent. If they are not, you can always keep a separate table of message IDs that have been processed, and check against it.
Either way, it's hard, complex and there are thousands of intermediate failure cases you have to think about. And for each possible solution (2pc, separate table of message IDs for idempotency, etc) you bring more complexity and problems to the table.
To be clear, it is not that the SKIP LOCKED solution is invalid, it is just that there are scenarios where it is not sufficient.
Does anyone think that sending message to SQS is slow?
Edit: With this update, I was able to process almost 3 x requests with the same resources, and it lowered my bills quite a lot.
For example my SQS bill for last month
Amazon Simple Queue Service EUC1-Requests-Tier1 $0.40 per 1,000,000 Amazon SQS Requests per month thereafter 290,659,096 Requests $116.26
it went to 0, and ec2 cost went down as well because ELB spun up fewer instances that I could handle more quest with the same resources.
This was my experience with SQS. I just wanted to share it.
1. By combining services, 1 less service to manage in your stack (e.g. do your demo/local/qa envs all connect to Sqs?)
2. Postgres preserves your data if it goes down
3. You already have the tools on each machine and everybody knows the querying language to examine the stack
4. All your existing DB tools (e.g. backup solutions) automatically now cover your queue too, for free.
5. Performance is a non-issue for any company doing < 10m queue items a day.
I don't think SQS is primarily for low-latency messaging, but rather a provided high available MQ with very little hassle.
import psycopg2
import psycopg2.extras
import random
db_params = {
'database': 'jobs',
'user': 'jobsuser',
'password': 'superSecret',
'host': '127.0.0.1',
'port': '5432',
}
conn = psycopg2.connect(**db_params)
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
def do_some_work(job_data):
if random.choice([True, False]):
print('do_some_work FAILED')
raise Exception
else:
print('do_some_work SUCCESS')
def process_job():
sql = """DELETE FROM message_queue
WHERE id = (
SELECT id
FROM message_queue
WHERE status = 'new'
ORDER BY created ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING *;
"""
cur.execute(sql)
queue_item = cur.fetchone()
print('message_queue says to process job id: ', queue_item['target_id'])
sql = """SELECT * FROM jobs WHERE id =%s AND status='new_waiting' AND attempts <= 3 FOR UPDATE;"""
cur.execute(sql, (queue_item['target_id'],))
job_data = cur.fetchone()
if job_data:
try:
do_some_work(job_data)
sql = """UPDATE jobs SET status = 'complete' WHERE id =%s;"""
cur.execute(sql, (queue_item['target_id'],))
except Exception as e:
sql = """UPDATE jobs SET status = 'failed', attempts = attempts + 1 WHERE id =%s;"""
# if we want the job to run again, insert a new item to the message queue with this job id
cur.execute(sql, (queue_item['target_id'],))
else:
print('no job found, did not get job id: ', queue_item['target_id'])
conn.commit()
process_job()
cur.close()
conn.close()I know some people use Akka with Persistence module, but I would welcome other alternatives.
Each shard allows at most 5 GetRecords operations per second. If you want to fan out to many consumers, you will reach those limits quickly and have to implement a significant latency/throughput tradeoff to make it work.
For API limits, see: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_...
SQS is a very simple service, which makes it fairly reliable, though part of the reason for the reliability is that the API's guarantees are weak. And it can be economical, but I've had to build a lot of non-trivial logic in order to interact with SQS robustly, performantly, and efficiently, especially around using the {Send,Receive,Delete}MessageBatch operations to reduce costs.
With the caveat that I think my use case has been quite different from what's discussed in this article, here are some of the problems I've encountered:
- Message sizes are limited, but in a convoluted way: SendMessageBatch has a 256KiB limit on the request size. Message values have a limited character set allowed, so you need to base64-encode any binary data. This also means that there's not exactly a max message size; you can batch up to 10 messages per SendMessageBatch but not in excess of 256KiB for the whole request.
- If you want to send more than 256KiBx3/4-(some padding) or around 180KiB of data for any single message, you need to put that data somewhere else and pass a pointer to it in the actual SQS message.
- SQS does routinely have temporary (edit: partial) failures that generally last for a few hours at a time. ReceiveMessageBatch may return no messages (or less than the max of 10) even if the queue has millions of messages waiting to be delivered; SQS knows it has them somewhere but it can't find them when you ask. And DeleteMessageBatch may fail for some of the messages passed while succeeding for others; it will sometimes fail repeatedly to delete those messages for an extended period.
- The SDKs provided by AWS (for either Java or Go) don't help you handle any of these things well; they just provide a window into the SQS API, and leave it to the user to figure all the details out.
The SKIP LOCKED bit means that once the queue item has been grabbed FOR UPDATE, it cannot then be grabbed FOR UPDATE by any other queries, so it has an exclusive lock on the worker queue item.
It's pretty robust and works fine for servicing multiple workers.
From what I remember from numbers from New Relic.
- The average SQS Put latency is: 10-20 ms.
- RabbitMQ Publish Message in the same DC should not be more than the round trip time of a DC assuming a persistent connection. So, about 0.5 ms
Because it is hacky from the perspective of a distributed system architecture. It's coupling 2 components that probably ought not be coupled because it's perceived as "convenient" to do so. The idea that your system's control and data planes are tightly coupled is a dangerous one if your system grows quickly. It forces a lot of complexity into areas where you'd rather not deal with it, later manifesting as technical debt as folks who need to solve the problem have to go hunting through the codebase to find where all the control and throttling decisions are being made and what they touch.
> 1. By combining services, 1 less service to manage in your stack (e.g. do your demo/local/qa envs all connect to Sqs?)
"Oh this SQS is such a pain to manage", is not something anyone who has used SQS is going to say. It's much less work than a database. It has other benefits too, like being easier for devs by being able to trivially replicate staging and production environments is a major selling point of cloud services in the first place. And unlike Postgres, you can do it from anywhere.
The decision to use the same postgres you use for data as a queue also concentrates failure in your app and increases the blast radius of performance issues concentrated around one set of hardware interfaces. A bad day for your disk can now manifest in many more ways.
> 2. Postgres preserves your data if it goes down
Not if you're using a SKIP LOCKED queue (in the sense that it has exactly the same failure semantics as SQS, and you're more likely to lose your postgres instance than SQS with a whole big SRE team is to go down). Can you name a data loss scenario for SQS in the past 8 years? I can think of one that didn't hit every customer, and I'm not at AWS. Personally, I haven't lost data to it despite a fair amount of use since the time us-east-1 flooded. Certainly "reliability" isn't SQS's primary problem.
> 3. You already have the tools on each machine and everybody knows the querying language to examine the stack
SQS has a web console. You don't need to know a querying language. If you're querying off a queue, you have ceased to have an unqualified queue.
> 4. All your existing DB tools (e.g. backup solutions) automatically now cover your queue too, for free.
Given most folks who are using AWS are strongly encouraged and empoewred to use RDS, this seems like a wash. But also: SHOULD you back up your queues? This seems to me like an important architectural question. If your system goes down hard, should/can it resume where it left off when restarting?
It'd argue the answer is almost always "no." But YMMV.
> 5. Performance is a non-issue for any company doing < 10m queue items a day.
You say, and then your marketing team says, "We're gonna be on Ellen next week, get ready!"
The reason why is that it's probably a bad idea in the first place to mix your control and data interfaces into an inextricable knot. Since you're not going to be able to institute rate limiting at a sub-millisecond latency in a distributed system anyways, why isn't your control plane separate and instrumented?
Once you have a separate control plane, you can introduce back-pressure in many different ways and do so with a better understanding of the dissemination of those throttling values throughout your system.
So what I see are engineers who are actually ignoring the the architectural decision with at least as many implications as "queue vs diffuse interface," namely "control and data or just one monolithic system."
FIFO (First-In-First-Out) queues are designed to enhance messaging between applications when the order of operations and events is critical, or where duplicates can't be tolerated
I was on-call that day for an AWS service. There wasn't much I could do but sit muted on the conference call and watch some TV, waiting for the outage to be over.
What does "hacky" even mean? If it means using side effects for a primary purpose then no, SKIP LOCKED is not a side effect.
I researched alot of alternative queues to SQS and tried several of them but all of them were complex, heavyweight, with questionable library support and more trouble than they were worth.
The good thing about using a database as a queue is that you get to easily customise queue behaviour to implement things like ordering and priority and whatever other stuff you want and its all as easy as SQL.
As you say, using Postgres as a queue cut out alot of complexity associated with using a standalone queueing system.
I think MySQL/Oracle/SQL server might also support SKIP LOCKED.
If you're on the JVM then ActiveMQ (and other Java queues) will usually run embedded and have options to use a database for persistence.
>>>Not if you're using a SKIP LOCKED queue,
Can you provide a reference for this? I'm not aware that using SKIP LOCKED data is treated any differently to other Postgres data in terms of durability.
- https://nsq.io/
- https://nats.io/The rest of the sentence is a challenge, "show me a data loss bug in SQS that extends beyond the failure cases you've already implicitly agreed to using postgres itself."
Reliable and durable? Emphatically not Redis, and at that point you probably want to just start looking at SQS.
(I've had good luck shipping products as docker-compose files, these days. Even to fairly backwards clients.)
As I was figuring out how to setup a datastore, query it for running workflows and all that jazz, I happened upon an interesting SQS feature: Post with Delay.
And so, the system has no database. Instead, when new work arrives it posts the details of the work to be done to SQS. All hosts in the fleet are polling SQS for messages. When they receive one, they do the checks and if the process isn't complete they repost the message again with a 5-minute delay. In 5 minutes, a host in the fleet will receive the message and try again. The process continues as long as it needs to.
Looking back, part of me now is horrified at this design. But: that system now has thousands of users and continues to scale really well. Data loss is very rare. Costs are low. No datastore to manage. SQS is just really darned neat because it can do things like that.
Then any new lambdas or other services that want to subscribe to messages will have another queue, and another, etc.
I haven't had a case where I had service groups coming up and down, I'm struggling to think of a use case.
For example, an AWS Lambda triggered from SQS will lead to thousands of executions, each lambda pulling a new message from SQS.
But another consumer group, maybe a group of load balanced EC2 instances, will have a separate queue.
In general, I don't know of cases where you want a single message duplicated across a variable number of consumer groups - services are not ephemeral things, even if their underlying processes are. You don't build a service, deploy it, and then tear it down the next day and throw away the code.
"nsqd is the daemon that receives, queues, and delivers messages to clients."
Systems "without queues" have just eschewed one queue for N queues, where N is the number of clients buffering actions for retries at any given moment.
https://metacpan.org/release/Directory-Queue/source/lib/Dire...
So how does one lock a message in s3? Does s3 have a "createIfDoesNotExistOrError"? I'm still having difficulty understanding how the proposed system avoids race conditions.
An example,
> Convert something to an async operation and your system will always return a success response. But there's no guarantee that the request will actually ever be processed successfully.
Great! I don't want service A to be coupled to service B's ability to work. I want A to send off a message and leave it to B to succeed or fail. This separation of state (service A and B can't even talk to each other directly) is part of what makes queues so powerful - it's also the foundation of the actor model, which is known for its powerful resiliency and scalability properties.
The author's suggestion of using synchronous communication with backpressure and sync failures is my last ditch approach. I have to set up circuit breakers just to make something like this anything less than a total disaster with full system failure due to a single service outage.
Like the author, the "good use cases for queues" is very nearly 100% for me. I believe you should reach for queues first, and it's worth remodeling a system to be queue based if you can help it.
Sometimes modeling as synchronous control is easiest, but I'm happy that I can avoid that in almost every case.
Technically, it's not even really a failure of SQS because the guarantees SQS makes are so weak that those partial failures are really "operating normally."
It's funny reading this after using Erlang/Elixir over the last few years. The default is always async with the assumption it will fail - as async processes failing is a core part of the OTP application architecture.
It's not something to be feared but a key part of how your application data-flow works.
Maybe I'm wrong and it's not so much work in the end. Hoping for some feedback.
A non-comprehensive list of ways I've seen my developers shoot themselves in the foot:
* Giant try-catch block around the message handling code to requeue messages that threw an exception. They neglected to add any accounting, so some messages would just never process. No one noticed until they saw the queue size never dropped below a certain amount during debugging.
* Queue behavior is highly dependant on configuration. Bad queue configurations result in dropped messages. Queueing systems provide few features to detect and alert on these failures (it's rally not their job), but building a system to track the integrity of the business process across queues is deemed to onerous.
* The built-in observability is generally not enough to be complete. I haven't seen a lot of great instrumentation libraries for SQS like there are for HTTP, meaning that observability is pushed on to the developer. They typically ignore that requirement because PMs rarely care until they realize we're unable to respond to incidents effectively.
* Most people vastly overestimate their scale. The number of applications I've seen built on SQS "because scale" that end up taking less than 100 QPS globally is significant. Anecdotally, I would say the majority of queue-based apps I have seen could have solved their scaling issues within HTTP.
* Many people want to treat queued messages like time-delayed HTTP requests. They are not, the semantics and design are totally different. I have seen people marshal requests to Protobuf, use it as the body of a message, and had another service read and process the request, and write another message to a queue that the first app reads back. It's basically gRPC over queues. Except that it solves none of the problems gRPC does, and creates a lot of problems. Just an example, how do you canary when you can't guarantee that the version of the app that sends the request will get the response to that request?
I think SQS is an amazing tool in the hands of people that know when to use it, and how to use it. But my experience has been that most people don't, and the ecosystem to make it available to people who aren't experts just doesn't exist yet.
Personally, I would never use a relational database as a queue, but I'm already very familiar with SQS's semantics.
Especially for temporary batch processing jobs, where I really don't want to be modifying my production tables, SQS queues rock.
You can configure Redis for durability. The docs[1] page for persistence has a good examination of the pros and cons.
The takeaway for me is: distributed systems are hard. If you have distributed workers, you have entered into a vastly more complex realm. SQS gives you some tools to work successfully in that environment, but it doesn't (and can't) get rid of that complexity. Most of the problems I've seen relate to engineers not understanding the fundamental complexity of coordinating distributed work. Your choice of tech stack for your queues isn't going to make a big difference if you don't understand what you're fundamentally dealing with.
What are our options for keeping SQS but somehow sending large payloads? Only thing I can think of is throwing them into another datastore and using the SQS just as a pointer to, say, a key in a Redis instance.
(Kafka is probably off the table for this, but I could be convinced. I'd like to hear other solutions first, though.
It's something to be aware of and to have backup plans for, but we've been using Redis as our primary datastores for over a year with only one or two instance failures which were quickly resolved within minutes by failing over to replicas, with no data loss.
Definitely similar experience here. We handle ~10 million messages a day in a pubsub system quite similar in spirit to the above, running on AWS Aurora MySQL.
Our system isn't a queue. We track a little bit of short-lived state for groups of clients, and do low-latency, in-order message delivery between clients. But a lot of the architecture concerns are the same as with your queue implementation.
We switched over to our own pubsub code, implemented the simplest way we could think of, on top of vanilla SQL, after running for several months on a well-regarded SaaS NoSQL provider. After it became clear that both reliability and scaling were issues, we built several prototypes on top of other infrastructure offerings that looked promising.
We didn't want to run any infrastructure ourselves, and didn't want to write this "low-level" message delivery code. But, in the end, we felt that we could achieve better system observability, benchmarking, and modeling, with much less work, using SQL to solve our problems.
For us, the arguments are pretty much Dan McKinley's from the Choose Boring Technology paper.[0]
It's definitely been the right decision. We've had very few issues with this part of our codebase. Far, far fewer than we had before, when we were trying to trace down failures in code we didn't write ourselves on hardware that we had no visibility into at all. This has turned out to be a counter-data point to my learned aversion to writing any code if somebody else has already written and debugged code that I can use.
One caveat is that I've built three or four pubsub-ish systems over the course of my career, and built lots and lots of stuff on top of SQL databases. If I had 20 years of experience using specific NoSQL systems to solve similar problems, those would probably qualify as "boring" technology, to me, and SQL would probably seem exotic and full of weird corner cases. :-)
Based on the structure of the message (UUIDv4) you could probably roll your own implementation in any language.
What it works really great for if you don't want to do the up front investment in managing a stateful cluster, is doing multi-step or fan-out processing. BEAM/OTP really shines when it's helpful to have individual processing steps coordinated but isolated, but where if a job needs to cancel and rerun (interrupted by a node restart or OOM), it's not an issue.
This is great resource https://www.erlang-in-anger.com/
Same as SQS.
Say the outage results in a few million messages that need to be retried. Some subset of those few million will never succeed (aka they are “poisoned pills”). At the same time, new messages are arriving.
In your system, how do you maintain QoS for incoming messages as well as allow for the resolution of the few million retries while also preventing the poisoned pills from blocking the queue? How do you implement exponential backoff, which is the standard approach for this?
SQS gives you some simple yet powerful primitives such as the visibility timeout setting to address this scenario in a straightforward manner.
Google Cloud really outshines AWS here with its serverless PubSub - its trivial to fan out, its low latency, and has similar delivery semantics (I think), and IMHO better, easier api's. Its a really impressive service, IMHO.
Also back pressure isn’t difficult to implement. Simply read the estimated size of the queue every N minutes and pass sending until it goes down to a more manageable level. Obvious downside is that it’s client side.
I was under the impression that's the industry standard - you drop the payload in some redis-like storage and pass keys in messages.
We do it "by ourselves", not using the provided lib, because that way it works both for SQS and SNS. The provided lib only supports SQS.
Also our messages aren't typically very big, so we do this only if the payload size demands it.
If you are using the queue as a log of events (i.e. user actions), you get an atomic guarantee that the db data is updated and the event describing this update has been recorded.
I guess if you're at the point where your engineering time to implement this + all of the features on top of it that you might need from SQS and future maintenance of this custom solution is cheaper than the cost of using SQS, and you have no other outstanding work that your engineering team should be doing instead, this is a viable cost optimization strategy.
But that's a whole lot of ifs, and with customers I've mostly worked with, they're far better served just using SQS.
Depending on the workload this could be not a big deal or very expensive. Treating a queue as a database, particularly queues that can't participate in XA transactions, can get you in trouble quick.
https://github.com/mbuhot/ecto_job for Elixir
https://github.com/timgit/pg-boss for Node.js
We just recently started gzipping our payloads, which buys us even more time.
Producer system puts 500 messages on the queue. Consumer system can't see anything for 90 minutes. Then mysteriously the messages show up.
The status page stays green, not even a note about elevated error rates.
The zero cost stuff is always going to be a big draw with cloud deployments and the various demands from the company. Although a lot of this stuff like messaging and clustering/failover is within the application/Beam VM itself rather than something scaled or managed externally to the software. But that level of server and infrastructure stuff is out of my league of understanding.
If you want a reliable system along those lines than you need to use SKIP LOCKED to SELECT one row to lock, then process it, and then DELETE the row. If your process dies then the lock will be release. You still have a new flavor of the same problem: you might process a message twice because the process might die in between completing processing and deleting the row. You could add complexity: first use SKIP LOCKED to SELECT one row to UPDATE to mark in-progress and LOCK the row, then later if the process dies another can go check if the job was performed (then clean the garbage) or not (pick and perform the job) -- a two-phase commit, essentially.
Factor out PG, and you'll see that the problem similar no matter the implementation.
If you don't want to keep the transaction open than you can just go back to updating a column containing the message status, which avoids keeping a transaction open but might need a background process to check for stalled out consumers.
It is a very, very slippery slope and it's seductive to put everything into Postgres at low scale but I've seen more places sunk by technical debt (because they can't move fast enough to catch up to parity with market demands, can't scale operationally with the budget, or address competitors' features) than ones that don't get enough traction where technical debt becomes a problem. But perhaps my experience is perversely biased and I'm just driven to over-engineer stuff because I've been routinely handed the reins of quite literally business non-viable software systems that were over-sold repeatedly while Silicon Valley sounds like it has the inverse problem of over-engineering for non-viable businesses.
That's hard to argue with.
The example code shows how it should be done - a simple table dedicated to messages which supplies the id of the work/job to be carried out.
Not much can be done architecturally to address developers messing that up.
So a duplicate message should be processed as normal anyway, e.g. by deduplication within a reasonable window, and/or by having idempotent operations.
I can't vouch for the queueing code but I believe it's quite robust too.
It's important to make this distinction because there are commonly-used systems that offer a best-effort style persistence that usually works fine, but explicitly warn developers not to trust it, and developers rarely understand that.
We badly need to get better at distinguishing between true, production-level data integrity and "probably fine".
My personal experience is that abuse of queuing/messaging systems along this axis is rampant. Engineering leaders must keep a close eye on how these types of mechanisms are utilized to ensure things don't go off the rails.
I've seen far too many serious data loss events that boil down to "we lost our AMQP queue". It's critical that developers understand the limitations of the systems that run their code rather than just jumping aboard that "SQL is for old people" hype train.
What other options would you recommend, that can provide at least once delivery and are lighweight enough not to require zookeeper etc?
But their only method of throttling is to scale up and down base on failures. And it has been very unpredictable for me.
Even though my webhook started failing and timing out on requests, pubsub just kept hammering my servers until it brought it completely to it's knees. Logs on Google's end showed 1,500 failed attempts per second and 0.2 successes per second. It hammered at this rate for half an hour.
Seems like their Push option really needs some work.
Although 100-300ms seems pretty good for total round-trip latency to most message queues. Another thing to make sure of is that whatever HTTP client you’re using to interact with AWS is using pipelining. It’s off by default for the JS libraries for example.
Looking solely from a code perspective, how would this be done prior to Postgres 9.5 (before SKIP LOCKED)? Of the four examples I saw (either MySQL or Postgres), all of them choked constantly and had a pathological case of a job failing to execute expediently during peak usage which made on-call a nightmare. So what do we do about the places that can't upgrade their Postgres databases because it's too complicated to decouple messaging and business data now? Based upon my observations the answer is "it's never done" or the company goes under from lack of ability to enact changes caused by the paralysis.
The reality is that simple jobs are almost never enough outside toy examples. Soon, someone wants to be notified about job status changes - then comes an event table with a foreign key constraint on the job table's primary key. Also add in job digraphs with sub-tasks and workflows (AKA weak transactions) - these are non-trivial to do well. Depending upon your SQL variant and version combined with access patterns, row level locking and indexing can be an O(1) lookup or an O(n^2) nightmare that causes tons of contention depending (once again) upon the manner in which jobs are modified.
Instead of thinking about "what are my transaction boundaries and what do my asynchronous jobs look like?" which are much more invariant and important to the business trying to map their existing solution onto as many problems as possible. Then it should be more clear whether you would be fine with a RDBMS table, Airflow, SQS, Redis, RabbitMQ, JMS, etc. Operations-wise more components is certainly a headache, but I've had more headaches in production due to inappropriate technology for the problem domain than "this is just too many parts!"
OrderCreated(order_id=23)
UserLoggedIn(user_id=342)
etc.
I guess you are sending the whole user object over the event bus? Isnt that an anti-pattern?
It's possible to implement SKIP LOCKED in userland in PostgreSQL (using NOWAIT, which has been in PG), although it's obviously a bit slower.
Log(4) = 2, Log(16) = 4, Log(256) = 8, Log(65536) = 16, Log(4,294,967,296) = 32
Assuming other things remaining unaffected, theoretically that means in a set of 4 million the operation should be about 4 times slower than in a set of 250 elements.
Any middle tiering of the data before it reaches the consumer, is still "the queue". You don't need to know the internals of SQS, anymore than a consumer need know the black box elements of how you collate the messages within your ad-hoc queue.
Here's a good post (and series) about distributed logs and NATS design issues: https://bravenewgeek.com/building-a-distributed-log-from-scr...
I can't do any analytics about how long things typically take, who my biggest users are, etc. I mean, I could, but I'd have to add a datastore for that.
Adding new details to the parameters of the system requires very careful work to make all changes backwards and forwards compatible so that mid-deployment we don't have messages being pushed that old hosts can't process or new hosts seeing old messages they don't understand. That's good practice generally, but it's super mission critical to get right this way.
Also, a dropped message is invisible. SQS has redrive, sure, and that helps but if there were a bug, an edge case, where the system stopped processing something and quietly failed, that processing would just stop and we'd never know. If the entries were in a datastore, we'd see "Hey, this one didn't finish and I havne't worked on it lately, what gives?".
The very handy thing about the setup described, is that your data tables are part of the same MVCC world-state as your message queue. So you do all the work for the job, in the context of the same MVCC transaction that is holding the job locked; and anything that causes the job to fail, will fail the entire transaction, and thus rollback any changes that the job's operation made to the data.