zlacker

> Multiple times, I've pushed an SQL-based queue a couple orders of magnitude past the scale where others say SQL craps out and a distributed solution is an absolute requirement.

What about availability, though? The distributed solution is also useful to avoid downtime in case of single node failure.

Is there an off-the-shelf solution that lets me do that with Postgres? I know the newest version (16) just added active-active replication, but I wouldn’t know how to use that to achieve resilience.

replies(3): >>adamck+zB >>dekker+qD >>bunder+bHc

>>runeks+(OP)
But if you're using Postgres as your queuing system because you're already using it as your core database technology for your app, you've got the same issue. If your single Postgres instance is down then your app is, too, and won't be enqueuing more jobs.

And unless your jobs are trivial then it's highly likely that they interact with your app in some way so it doesn't really matter if your workers are distributed and up, they're not able to complete their work because your app is down because of a single-node Postgres.

replies(1): >>acaloi+Gj3

>>runeks+(OP)
What the article writes about scalability also applies to availability. Does the queue need 99.999% or 99.9999% uptime? Or is the Service Level Objective actually 99.99%, 99.9% or even 99.5%?

With 99.99% you can have 4 minutes of downtime a month. If failover to a hot standby takes a minute then that shouldn't be a problem to achieve a 99.99% uptime SLO.

>>adamck+zB
What you're pointing out is an architectural constraint that's unrelated to how and where one queues jobs.

If background jobs need to be available while some other core application is down, that needs to be designed for, and that design can be achieved with any queue technology. Simply separate the queue system stack from the core application system stack.

> But if you're using Postgres as your queuing system because you're already using it as your core database technology

Note your own use of "database technology" and not "database server". It's common to have separate application and queue database servers when such an architectural constraint is present. Of course, this sacrifices the benefit of transactional guarantees when the application and background jobs run on the same server.

Like I said in the post, technology (and architectural) choices are tradeoffs all the way down :)

>>runeks+(OP)
I don't know PostgreSQL as well. I mostly use MSSQL in production, and it's had good replication for ages, so I just wasn't really that worried about single-node failure for the critical stuff.

And, frankly, even for the less-critical stuff that was only running on a single node, I still dealt with fewer availability problems back in the day than I do now that everything's gone distributed. I think that a thing that's been forgotten over the years is that a lot of this stuff that distributed systems do to be reliable was more about digging oneself out of the hole that was created by running on lots of cheap hardware instead of using a single server with redundancy built-in. I acknowledge that, past a certain scale, that's the only option that makes sense. But if you're not operating at that scale then there's a good chance it's all just verschlimmbessern.