Choose Postgres queue technology

>>bo0tzz+(OP)
I've built three distributed job systems at this point. A handy rule of thumb which I have promoted for years is "build for 10x your current scale."

If you need to handle 70 requests/second, design for 700. If you need to handle 20 servers running batch jobs, design for 200 servers. If you're in a startup that grows 100% per year, you'll be at 8x scale in 3 years. So you'll have time to rewrite as you grow!

Out of three job systems I built, the first one tried to avoid SQL for "scalability." Then we hit a bunch of edge cases that needed transactional integrity, and we were in a world of pain.

My two more recent distributed job systems use PostgreSQL as a coordinator. They're literally built around SELECT FOR UPDATE SKIP LOCKED. One of them routinely controls 350 workers and the other does elaborate prioritization for thousands of jobs. Both of them will continue to run just fine until they're earning millions of dollars a year—for example, the one controlling 350 workers should scale to about 2,000 CPUs with a little work.

Hyperscale technologies are fairly cheap right up until you discover you need transactions. At that point, faking transactional semantics on top of an eventually consistent data store becomes an engineering nightmare.

So sit down, and do the math. If your company was earning $100 million/year, how big would your distributed system need to be? Can you easily get a PostgreSQL instance big enough to handle that load? (Or could you just shard per client?) If so, strongly consider using PostgreSQL. It makes a hundred things trivial.

>>ekidd+et1
Similar experience here. Multiple times, I've pushed an SQL-based queue a couple orders of magnitude past the scale where others say SQL craps out and a distributed solution is an absolute requirement. And the SQL solution is typically simpler, requires fewer compute resources, and easier to support in production.

But, to make it work, you've got to know the database well enough to know that things like SELECT FOR UPDATE SKIP LOCKED exist in the first place. Which is a kind of knowledge that's getting quite rare these days, because more and more engineers grow up never having known a world where they aren't walled off from their DBMS's true capabilities by a heavyweight ORM.

>>bunder+la2
> Multiple times, I've pushed an SQL-based queue a couple orders of magnitude past the scale where others say SQL craps out and a distributed solution is an absolute requirement.

What about availability, though? The distributed solution is also useful to avoid downtime in case of single node failure.

Is there an off-the-shelf solution that lets me do that with Postgres? I know the newest version (16) just added active-active replication, but I wouldn’t know how to use that to achieve resilience.

>>runeks+mN4
I don't know PostgreSQL as well. I mostly use MSSQL in production, and it's had good replication for ages, so I just wasn't really that worried about single-node failure for the critical stuff.

And, frankly, even for the less-critical stuff that was only running on a single node, I still dealt with fewer availability problems back in the day than I do now that everything's gone distributed. I think that a thing that's been forgotten over the years is that a lot of this stuff that distributed systems do to be reliable was more about digging oneself out of the hole that was created by running on lots of cheap hardware instead of using a single server with redundancy built-in. I acknowledge that, past a certain scale, that's the only option that makes sense. But if you're not operating at that scale then there's a good chance it's all just verschlimmbessern.

zlacker