zlacker

[parent] [thread] 15 comments
1. ekidd+(OP)[view] [source] 2023-09-25 12:15:19
I've built three distributed job systems at this point. A handy rule of thumb which I have promoted for years is "build for 10x your current scale."

If you need to handle 70 requests/second, design for 700. If you need to handle 20 servers running batch jobs, design for 200 servers. If you're in a startup that grows 100% per year, you'll be at 8x scale in 3 years. So you'll have time to rewrite as you grow!

Out of three job systems I built, the first one tried to avoid SQL for "scalability." Then we hit a bunch of edge cases that needed transactional integrity, and we were in a world of pain.

My two more recent distributed job systems use PostgreSQL as a coordinator. They're literally built around SELECT FOR UPDATE SKIP LOCKED. One of them routinely controls 350 workers and the other does elaborate prioritization for thousands of jobs. Both of them will continue to run just fine until they're earning millions of dollars a year—for example, the one controlling 350 workers should scale to about 2,000 CPUs with a little work.

Hyperscale technologies are fairly cheap right up until you discover you need transactions. At that point, faking transactional semantics on top of an eventually consistent data store becomes an engineering nightmare.

So sit down, and do the math. If your company was earning $100 million/year, how big would your distributed system need to be? Can you easily get a PostgreSQL instance big enough to handle that load? (Or could you just shard per client?) If so, strongly consider using PostgreSQL. It makes a hundred things trivial.

replies(4): >>nerpde+yc >>bunder+7H >>merb+zj1 >>aaroni+9n5
2. nerpde+yc[view] [source] 2023-09-25 13:17:04
>>ekidd+(OP)
Design (and test) for 10x your current scale, build for what you need now. And the system has to be able to handle peak loads, and if you don't know what those are build in a safety margin or a way to shed or defer work if you need to.

Everything is a tradeoff, optimize for the things that need optimizing, and determining what those are is the hallmark of a good engineer.

replies(1): >>insani+Kr
◧◩
3. insani+Kr[view] [source] [discussion] 2023-09-25 14:21:14
>>nerpde+yc
I've found the same. You should understand what your 10x / 100x growth solution would look like (assuming that that's relevant - obviously if you have no intent to hit that scale, don't bother). Build your system to handle your 1-1.5x, maybe 10x scale, and make sure you're not blocking the 10-100x solutions by doing so.
4. bunder+7H[view] [source] 2023-09-25 15:13:08
>>ekidd+(OP)
Similar experience here. Multiple times, I've pushed an SQL-based queue a couple orders of magnitude past the scale where others say SQL craps out and a distributed solution is an absolute requirement. And the SQL solution is typically simpler, requires fewer compute resources, and easier to support in production.

But, to make it work, you've got to know the database well enough to know that things like SELECT FOR UPDATE SKIP LOCKED exist in the first place. Which is a kind of knowledge that's getting quite rare these days, because more and more engineers grow up never having known a world where they aren't walled off from their DBMS's true capabilities by a heavyweight ORM.

replies(1): >>runeks+8k3
5. merb+zj1[view] [source] 2023-09-25 17:31:05
>>ekidd+(OP)
you can even built a advisory lock queue in Postgres, which is way slower but has some benefits.
replies(1): >>pas+vk1
◧◩
6. pas+vk1[view] [source] [discussion] 2023-09-25 17:34:20
>>merb+zj1
slower to build or slower to run?
replies(2): >>merb+gl1 >>0x457+KOp
◧◩◪
7. merb+gl1[view] [source] [discussion] 2023-09-25 17:37:44
>>pas+vk1
slower to run, but when you keep the postgres connection open you will know that the job is still running, while with for update skip locked you would need to have a status and a job_timeout basically.

so pg_try_advisory_lock/pg_advisory_unlock can lock over transactions while for update skip locked can't, thus you would either need to keep a transaction open or use status+job_timeout (and in postgres you should not use long transactions)

basically we use c#, but we looked into https://github.com/que-rb/que which uses advisory_locks, since our jobs take like 1 min to 2 hours it was a no-brainer to use advisory_locks. it's just not the best thing if you have thousands of fast jobs per second, but for a more moderate queue where you have like 10000 jobs per minute/10 minutes/30 minutes and they take like 1 min to 2 hours its fine.

we also do not delete jobs, we do not care about storage since the job table basically does not take a lot. and we have a lot of time to catchup at night since we are only in europe

replies(1): >>ekidd+lr1
◧◩◪◨
8. ekidd+lr1[view] [source] [discussion] 2023-09-25 18:03:20
>>merb+gl1
Here's my current favorite recipe for building complex job systems on PostgreSQL. I'm not thinking about "send an email"-type jobs, but bigger jobs that do complex tasks.

The usual trick I use is to have a `jobs.state` field containing "pending", "running", "done", or "error" (or whatever that job system needs). I only hold SELECT FOR UPDATE SKIPPED LOCKED long enough to:

1. Transition from "pending" to "running". Or a second time, to transition from "running" to either "done" or "error".

2. Store the current worker ID (often the Kubernetes pod name).

Then, I can build a watcher that wakes up every 5 minutes, and looks for "running" jobs with no corresponding Kubernetes pod, and mark them as "error". I try to never hold a lock for more than a second or two, and to never lock more than one job at once. This gets me 80% of the way there.

The reason I don't hold a transaction open for the entire job is because every transaction requires a PostgreSQL connection, and connections are surprisingly expensive. In fact, you may want to run connections through pgbouncer or a stateless REST API to avoid holding open hundreds or thousands of connections. Everything except PostgreSQL itself should ideally be stateless and restartable.

You might also have a retry system, or jobs that recursively queue up child jobs, or jobs that depend on other jobs, or more elaborate state machines. You might have timeouts. Most of these things are solveable with some mix of transations, some SQL, CREATE INDEX or CREATE VIEW. A database gives you so many things for free, if you're just a little careful about it.

And since Grafana supports SQL, you can easily build really nice dashboards for support and ops by just querying the tables used by the job system.

There are other ways to do it! But I'm fond of this general strategy for coarse-granularity jobs.

replies(1): >>runeks+Xk3
◧◩
9. runeks+8k3[view] [source] [discussion] 2023-09-26 06:39:22
>>bunder+7H
> Multiple times, I've pushed an SQL-based queue a couple orders of magnitude past the scale where others say SQL craps out and a distributed solution is an absolute requirement.

What about availability, though? The distributed solution is also useful to avoid downtime in case of single node failure.

Is there an off-the-shelf solution that lets me do that with Postgres? I know the newest version (16) just added active-active replication, but I wouldn’t know how to use that to achieve resilience.

replies(3): >>adamck+HV3 >>dekker+yX3 >>bunder+j1g
◧◩◪◨⬒
10. runeks+Xk3[view] [source] [discussion] 2023-09-26 06:45:37
>>ekidd+lr1
Isn’t this just >>29599132 with the addition of a custom worker ID (pod name)?
◧◩◪
11. adamck+HV3[view] [source] [discussion] 2023-09-26 11:16:23
>>runeks+8k3
But if you're using Postgres as your queuing system because you're already using it as your core database technology for your app, you've got the same issue. If your single Postgres instance is down then your app is, too, and won't be enqueuing more jobs.

And unless your jobs are trivial then it's highly likely that they interact with your app in some way so it doesn't really matter if your workers are distributed and up, they're not able to complete their work because your app is down because of a single-node Postgres.

replies(1): >>acaloi+OD6
◧◩◪
12. dekker+yX3[view] [source] [discussion] 2023-09-26 11:30:13
>>runeks+8k3
What the article writes about scalability also applies to availability. Does the queue need 99.999% or 99.9999% uptime? Or is the Service Level Objective actually 99.99%, 99.9% or even 99.5%?

With 99.99% you can have 4 minutes of downtime a month. If failover to a hot standby takes a minute then that shouldn't be a problem to achieve a 99.99% uptime SLO.

13. aaroni+9n5[view] [source] 2023-09-26 17:52:15
>>ekidd+(OP)
"faking transactional semantics on top of an eventually consistent data store becomes an engineering nightmare"

PREACH

If this is news to you, you just justified a week's worth of browsing HN on the clock reading this statement. Scribble it on a scrap of paper and keep it taped to your ATM card.

◧◩◪◨
14. acaloi+OD6[view] [source] [discussion] 2023-09-27 00:33:41
>>adamck+HV3
What you're pointing out is an architectural constraint that's unrelated to how and where one queues jobs.

If background jobs need to be available while some other core application is down, that needs to be designed for, and that design can be achieved with any queue technology. Simply separate the queue system stack from the core application system stack.

> But if you're using Postgres as your queuing system because you're already using it as your core database technology

Note your own use of "database technology" and not "database server". It's common to have separate application and queue database servers when such an architectural constraint is present. Of course, this sacrifices the benefit of transactional guarantees when the application and background jobs run on the same server.

Like I said in the post, technology (and architectural) choices are tradeoffs all the way down :)

◧◩◪
15. bunder+j1g[view] [source] [discussion] 2023-09-29 13:40:03
>>runeks+8k3
I don't know PostgreSQL as well. I mostly use MSSQL in production, and it's had good replication for ages, so I just wasn't really that worried about single-node failure for the critical stuff.

And, frankly, even for the less-critical stuff that was only running on a single node, I still dealt with fewer availability problems back in the day than I do now that everything's gone distributed. I think that a thing that's been forgotten over the years is that a lot of this stuff that distributed systems do to be reliable was more about digging oneself out of the hole that was created by running on lots of cheap hardware instead of using a single server with redundancy built-in. I acknowledge that, past a certain scale, that's the only option that makes sense. But if you're not operating at that scale then there's a good chance it's all just verschlimmbessern.

◧◩◪
16. 0x457+KOp[view] [source] [discussion] 2023-10-02 17:43:22
>>pas+vk1
IMO the main issue with it - advisory locks in postgres require an open connection being held the entire time lock is taken. Combine that with thread per connection model...
[go to top]