zlacker

I LOVE this idea. I usually hear other Sr. engineers denigrate it as "hacky," but I think they aren't really looking at the big picture.

1. By combining services, 1 less service to manage in your stack (e.g. do your demo/local/qa envs all connect to Sqs?)

2. Postgres preserves your data if it goes down

3. You already have the tools on each machine and everybody knows the querying language to examine the stack

4. All your existing DB tools (e.g. backup solutions) automatically now cover your queue too, for free.

5. Performance is a non-issue for any company doing < 10m queue items a day.

replies(5): >>KirinD+P2 >>andrew+24 >>cle+Mf >>snisar+VB >>devonk+EZ

>>zug_zu+(OP)
> I LOVE this idea. I usually hear other Sr. engineers denigrate it as "hacky," but I think they aren't really looking at the big picture.

Because it is hacky from the perspective of a distributed system architecture. It's coupling 2 components that probably ought not be coupled because it's perceived as "convenient" to do so. The idea that your system's control and data planes are tightly coupled is a dangerous one if your system grows quickly. It forces a lot of complexity into areas where you'd rather not deal with it, later manifesting as technical debt as folks who need to solve the problem have to go hunting through the codebase to find where all the control and throttling decisions are being made and what they touch.

> 1. By combining services, 1 less service to manage in your stack (e.g. do your demo/local/qa envs all connect to Sqs?)

"Oh this SQS is such a pain to manage", is not something anyone who has used SQS is going to say. It's much less work than a database. It has other benefits too, like being easier for devs by being able to trivially replicate staging and production environments is a major selling point of cloud services in the first place. And unlike Postgres, you can do it from anywhere.

The decision to use the same postgres you use for data as a queue also concentrates failure in your app and increases the blast radius of performance issues concentrated around one set of hardware interfaces. A bad day for your disk can now manifest in many more ways.

> 2. Postgres preserves your data if it goes down

Not if you're using a SKIP LOCKED queue (in the sense that it has exactly the same failure semantics as SQS, and you're more likely to lose your postgres instance than SQS with a whole big SRE team is to go down). Can you name a data loss scenario for SQS in the past 8 years? I can think of one that didn't hit every customer, and I'm not at AWS. Personally, I haven't lost data to it despite a fair amount of use since the time us-east-1 flooded. Certainly "reliability" isn't SQS's primary problem.

> 3. You already have the tools on each machine and everybody knows the querying language to examine the stack

SQS has a web console. You don't need to know a querying language. If you're querying off a queue, you have ceased to have an unqualified queue.

> 4. All your existing DB tools (e.g. backup solutions) automatically now cover your queue too, for free.

Given most folks who are using AWS are strongly encouraged and empoewred to use RDS, this seems like a wash. But also: SHOULD you back up your queues? This seems to me like an important architectural question. If your system goes down hard, should/can it resume where it left off when restarting?

It'd argue the answer is almost always "no." But YMMV.

> 5. Performance is a non-issue for any company doing < 10m queue items a day.

You say, and then your marketing team says, "We're gonna be on Ellen next week, get ready!"

replies(1): >>andrew+o4

>>zug_zu+(OP)
I don't think it's hacky - it's using documented Postgres functionality in the way it's intended. Engineers tend to react that way to anything unfamiliar, until they decide it's a good idea then they evangelise.

What does "hacky" even mean? If it means using side effects for a primary purpose then no, SKIP LOCKED is not a side effect.

I researched alot of alternative queues to SQS and tried several of them but all of them were complex, heavyweight, with questionable library support and more trouble than they were worth.

The good thing about using a database as a queue is that you get to easily customise queue behaviour to implement things like ordering and priority and whatever other stuff you want and its all as easy as SQL.

As you say, using Postgres as a queue cut out alot of complexity associated with using a standalone queueing system.

I think MySQL/Oracle/SQL server might also support SKIP LOCKED.

replies(1): >>kwindl+nl

>>KirinD+P2
>>>> 2. Postgres preserves your data if it goes down

>>>Not if you're using a SKIP LOCKED queue,

Can you provide a reference for this? I'm not aware that using SKIP LOCKED data is treated any differently to other Postgres data in terms of durability.

replies(1): >>KirinD+H4

>>andrew+o4
It's exactly the same case as SQS in that if your disk interface dies in mid write of a message you lose it the same as if your interface to SQS or SQS itself dies mid-write, you lose it.

The rest of the sentence is a challenge, "show me a data loss bug in SQS that extends beyond the failure cases you've already implicitly agreed to using postgres itself."

replies(2): >>gmueck+Qg >>crypto+aV

>>zug_zu+(OP)
As always, it depends on your use cases. The data management and API footprint story is way simpler with SQS than with a relational database, and you don't have to worry about all the scaling complexity that comes with relational databases. Plus you get things like AZ resiliency for free. Staging SQS queues for demo/local/qa/dev envs is trivial if you are using CloudFormation.

Personally, I would never use a relational database as a queue, but I'm already very familiar with SQS's semantics.

Especially for temporary batch processing jobs, where I really don't want to be modifying my production tables, SQS queues rock.

>>KirinD+H4
What possibilities are there for a write to fail that the database server cannot react to? I am under the impression that a write error like this would be reported as a failure to run the statement and the application is responsible for handling that.

replies(1): >>KirinD+dn

>>andrew+24
Thanks for posting that code!

Definitely similar experience here. We handle ~10 million messages a day in a pubsub system quite similar in spirit to the above, running on AWS Aurora MySQL.

Our system isn't a queue. We track a little bit of short-lived state for groups of clients, and do low-latency, in-order message delivery between clients. But a lot of the architecture concerns are the same as with your queue implementation.

We switched over to our own pubsub code, implemented the simplest way we could think of, on top of vanilla SQL, after running for several months on a well-regarded SaaS NoSQL provider. After it became clear that both reliability and scaling were issues, we built several prototypes on top of other infrastructure offerings that looked promising.

We didn't want to run any infrastructure ourselves, and didn't want to write this "low-level" message delivery code. But, in the end, we felt that we could achieve better system observability, benchmarking, and modeling, with much less work, using SQL to solve our problems.

For us, the arguments are pretty much Dan McKinley's from the Choose Boring Technology paper.[0]

It's definitely been the right decision. We've had very few issues with this part of our codebase. Far, far fewer than we had before, when we were trying to trace down failures in code we didn't write ourselves on hardware that we had no visibility into at all. This has turned out to be a counter-data point to my learned aversion to writing any code if somebody else has already written and debugged code that I can use.

One caveat is that I've built three or four pubsub-ish systems over the course of my career, and built lots and lots of stuff on top of SQL databases. If I had 20 years of experience using specific NoSQL systems to solve similar problems, those would probably qualify as "boring" technology, to me, and SQL would probably seem exotic and full of weird corner cases. :-)

[0] - https://mcfunley.com/choose-boring-technology

>>gmueck+Qg
The box turning off, the disk interface failing, and/or the link to the database failing in mid instruction.

Same as SQS.

replies(3): >>xyzzy_+aA >>z3t4+dD >>mcquee+0I

>>KirinD+dn
All of those would result in a write failure from the application's perspective, which is fine, and must be accounted for regardless (e.g. retry, two phase commit, log an error, whatever).

>>zug_zu+(OP)
There is another advantage.

If you are using the queue as a log of events (i.e. user actions), you get an atomic guarantee that the db data is updated and the event describing this update has been recorded.

replies(1): >>maniga+tY

>>KirinD+dn
Also the FS might report the data as written but its actually in a write cache and will be lost if the plug is pulled.

replies(1): >>gmueck+MM

>>KirinD+dn
But you have to explicitly delete the message from SQS, right? You'd only delete after confirming you processed the message, right? So if you die mid-instruction in processing a message, the message just re-appears in the SQS queue after the visibility timeout.

replies(1): >>KirinD+o01

>>z3t4+dD
Not an issue if you follow th e recommendations in the PostgeSQL documentation. The way to comfigure write caches is described there.

>>KirinD+H4
SKIP LOCKED is a read-side feature. PG is a problem for u/andrewstuart because they have to manage it themselves (unless they're getting a PG service from a cloudy provider, say) and that's harder than it is for Amazon to manage SQS.

>>snisar+VB
Also being able to run a query over the contents of the entire queue and the messages bodies (for example a JSON blob).

>>zug_zu+(OP)
I'm not opposed to a separate table or DB used exclusively for messaging separate from business level data, but I think I've never heard of anyone doing that in practice. It's almost always a bunch of hacky joins someone wants to do with the messages and the messaging gets directly coupled with the schema (foreign key constraints on transaction or event IDs and business entities is the usual case). Letting one or two things be used for purposes it wasn't necessarily designed for is alright but coupling the same place that has your OLTP together as your messaging system in itself is usually a culture where the same DB will be used in another month as a time series DB, then as an OLAP system, and then there's no time to properly decouple anything anymore / actually understand your needs and nobody wants to try to touch the very, very brittle DB that runs the entire business on its back.

It is a very, very slippery slope and it's seductive to put everything into Postgres at low scale but I've seen more places sunk by technical debt (because they can't move fast enough to catch up to parity with market demands, can't scale operationally with the budget, or address competitors' features) than ones that don't get enough traction where technical debt becomes a problem. But perhaps my experience is perversely biased and I'm just driven to over-engineer stuff because I've been routinely handed the reins of quite literally business non-viable software systems that were over-sold repeatedly while Silicon Valley sounds like it has the inverse problem of over-engineering for non-viable businesses.

replies(1): >>andrew+k01

>>devonk+EZ
I think you're saying it's not a good approach because developers will do it wrong?

That's hard to argue with.

The example code shows how it should be done - a simple table dedicated to messages which supplies the id of the work/job to be carried out.

Not much can be done architecturally to address developers messing that up.

replies(1): >>devonk+iV1

>>mcquee+0I
That's a feature, and one your have to replicate in your postgres queue.

>>andrew+k01
While the pattern is simple programmatically, it is insufficient operationally and sets a precedent of coupling your data plane and your messaging plane.

Looking solely from a code perspective, how would this be done prior to Postgres 9.5 (before SKIP LOCKED)? Of the four examples I saw (either MySQL or Postgres), all of them choked constantly and had a pathological case of a job failing to execute expediently during peak usage which made on-call a nightmare. So what do we do about the places that can't upgrade their Postgres databases because it's too complicated to decouple messaging and business data now? Based upon my observations the answer is "it's never done" or the company goes under from lack of ability to enact changes caused by the paralysis.

The reality is that simple jobs are almost never enough outside toy examples. Soon, someone wants to be notified about job status changes - then comes an event table with a foreign key constraint on the job table's primary key. Also add in job digraphs with sub-tasks and workflows (AKA weak transactions) - these are non-trivial to do well. Depending upon your SQL variant and version combined with access patterns, row level locking and indexing can be an O(1) lookup or an O(n^2) nightmare that causes tons of contention depending (once again) upon the manner in which jobs are modified.

Instead of thinking about "what are my transaction boundaries and what do my asynchronous jobs look like?" which are much more invariant and important to the business trying to map their existing solution onto as many problems as possible. Then it should be more clear whether you would be fine with a RDBMS table, Airflow, SQS, Redis, RabbitMQ, JMS, etc. Operations-wise more components is certainly a headache, but I've had more headaches in production due to inappropriate technology for the problem domain than "this is just too many parts!"

replies(1): >>anaraz+Hh2

>>devonk+iV1
> Looking solely from a code perspective, how would this be done prior to Postgres 9.5 (before SKIP LOCKED)?

It's possible to implement SKIP LOCKED in userland in PostgreSQL (using NOWAIT, which has been in PG), although it's obviously a bit slower.