zlacker

Postgres is a great pub/sub and job server (2019)

submitted by anonu+(OP) on 2021-12-17 22:27:35 | 468 points 200 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
3. sbaild+d9[view] [source] 2021-12-17 23:43:50
>>anonu+(OP)
Oban[1] from the Elixir ecosystem leans on Postgres for job scheduling

[1] https://github.com/sorentwo/oban

4. amarsh+R9[view] [source] 2021-12-17 23:48:25
>>anonu+(OP)
It’s also possible to use advisory locks to implement a job queue in Postgres. See e.g. Que[1]. Note there are a fair number of corner cases, so studying Que is wise if trying to implement something like this, as well as some (a bit older) elaboration[2].

We implemented a similar design to Que for a specific use case in our application that has a known low volume of jobs and for a variety of reasons benefits from this design over other solutions.

[1]: https://github.com/que-rb/que [2]: https://brandur.org/postgres-queues

5. daenz+X9[view] [source] 2021-12-17 23:49:44
>>anonu+(OP)
Strong disagree on using a database as a message queue. This article[0] covers many of the reasons why. Summary: additional application complexity and doesn't scale well with workers.

0. https://www.cloudamqp.com/blog/why-is-a-database-not-the-rig...

EDIT>> I am not suggesting people build their own rabbitmq infrastructure. Use a cloud service. The article is informational only.

7. vlmuto+5b[view] [source] 2021-12-17 23:57:08
>>anonu+(OP)
I had thought about using postgres as a job queue before, but I couldn't figure out in my head how to make sure two processes didn't both take the same job. The "FOR UPDATE" and "SKIP LOCKED" were the keys to make this work in the article. Essentially, as far as I can tell, "SELECT FOR UPDATE" locks the rows as they're selected (locks are apparently visible outside the transaction), and "SKIP LOCKED" skips over rows for the select that other transactions have locked. Cool stuff.

The below article goes over some downsides to using postgres for a job queue: namely that the data model isn't optimized to find and send new jobs to large numbers of subscribers. Actual messaging systems like Kafka/NATS are better at this.

https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock...

Also, there are things that the dedicated messaging systems give you that postgres won't, such as transparent error handling and retry logic. If your worker grabs a job from postgres and fails (in some kind of transient way such that the task should be retried), you'll have to implement the retry logic yourself. A streaming platform like Kafka or NATS will notice that you haven't acknowledged the message and deliver it to someone else.

This is something you could pretty easily implement yourself with something like a process that just scans over the "processing" jobs and looks for timeouts, resetting them as it goes. But there are probably a few of these little papercuts that Kafka-like (…Kafkaesque?) systems would handle for you.

So, I guess if you already have postgres set up in a reliable way, and there's no one around whose job it is to set up new production systems like Kafka, and you don't already have something like Kafka, and you only need basic job queue requirements or you're okay with implementing whatever you need on top of postgres yourself, and your incoming job rate is fewer than maybe a thousand per second, and you have fewer than maybe tens of consumers… postgres is probably a decent job queue.

The above paragraph seems snide but it probably does describe many people's environments.

◧◩
17. renewi+9e[view] [source] [discussion] 2021-12-18 00:18:52
>>daenz+X9
In the end, engineering is just whether the thing works with minimal maintenance. And ultimately, I've had great experiences using DBaaQ for low volume data movement. It works as a persistent queue with easy to do retries etc.

For high throughput (we had ad tech servers with 1E7 hits/s) we used a home-built low-latency queue that supported real time and persisted data. But for low throughput stuff, the DBaaQ worked fine.

And ultimately, maybe it was a lack of imagination on our part since Segment was successful with a mid-throughput DBaaQ https://segment.com/blog/introducing-centrifuge/

18. colinc+be[view] [source] 2021-12-18 00:19:11
>>anonu+(OP)
Author here! A few updates since this was published two years ago:

- The service mentioned (now called https://webapp.io ) eventually made it into YC (S20) and still uses postgres as its pub/sub implementation, doing hundreds of thousands of messages per day. The postgres instance now runs on 32 cores and 128gb of memory and has scaled well.

- We bolstered Postgres's PUBLISH with Redis pub/sub for high traffic code paths, but it's been nice having ACID guarantees as the default for less popular paths (e.g., webhook handling)

- This pattern only ever caused one operational incident, where a transaction held a lock which caused the notification queue to start growing, and eventually (silently) stop sending messages, starting postgres with statement_timeout=(a few days) was enough to solve this

Previous discussion: https://news.ycombinator.com/item?id=21484215

Happy to answer any questions!

◧◩
21. anothe+se[view] [source] [discussion] 2021-12-18 00:20:55
>>petilo+tb
I've been on stage at KafkaConf demo-ing my Kafka SRE chops, and I would avoid Kafka until I am sure it is necessary.

'easy to learn and use' is a downright lie.

edit: link to this same topic being discussed a few weeks ago: https://news.ycombinator.com/item?id=28903614#28904103

24. anothe+Ve[view] [source] 2021-12-18 00:24:27
>>anonu+(OP)
Related previous discussion: https://news.ycombinator.com/item?id=28903614#28904103
◧◩
29. seanp2+ef[view] [source] [discussion] 2021-12-18 00:27:38
>>vlmuto+5b
This. The article seems like "one weird trick that message queue companies HATE" as it's utilizing, as far as I understand, some SQL semantics in a very specific way to cobble together a way of achieving what other software is designed to do out of the box. It seems fine for a toy system, but I wouldn't stake the success of a real company on this approach.

One could also use DNS TXT as an RDBMS with some interesting fault tolerance and distribution schemes. That doesn't mean it's a good idea or the best way to solve a problem.

If you haven't seen them already, the Jepsen analyses are really worth a read: https://aphyr.com/posts/293-jepsen-kafka https://aphyr.com/posts/282-jepsen-postgres https://aphyr.com/tags/jepsen

◧◩
30. simonw+jf[view] [source] [discussion] 2021-12-18 00:28:23
>>daenz+X9
I'm increasingly of the opinion that relational databases are absolutely the right way to build queue systems for most projects.

One of the biggest advantages comes when you start thinking about them in terms of transactions. Transactional guarantees are really useful here: guarantee that a message will be written to the queue if the transaction commits successfully, and guarantee that a message will NOT be written to the queue otherwise.

https://brandur.org/job-drain describes a great pattern for achieving that using PostgreSQL transactions.

33. nicobu+4g[view] [source] 2021-12-18 00:33:12
>>anonu+(OP)
If anyone is looking to use postgres as a job server with node.js clients, I can highly recommend the pg-boss library (https://github.com/timgit/pg-boss). Looks like it just added pub/sub too (yesterday), so I guess you could use it for that too.
38. andrew+1h[view] [source] 2021-12-18 00:40:19
>>anonu+(OP)
I wrote StarQueue https://www.starqueue.org which uses Postgres as a back end.

I actually wrote StarQueue for MySQL, Microsoft SQL server and also Postgres - they all work just fine as a message queue back end, because they all support SKIP LOCKED.

I started out including Oracle as a supported database but found that I loathed Oracle's complexity.

40. static+gh[view] [source] 2021-12-18 00:42:21
>>anonu+(OP)
So SKIP LOCKED is a pretty well worn optimization at this point. People have been doing this for a while.

What kind of TPS are people seeing on queues based on psql?

edit: https://gist.github.com/chanks/7585810

10k/s here, but that was on a postgres from years ago - there have been like 4 or 5 major versions since then I think.

That's a good amount and I'm betting you can push it forward. Further, a queue is trivially sharded since messages are entirely isolated.

That said, Kafka can do hundreds of thousands if not millions of messages per second, so clearly there's room for optimizations.

◧◩
73. bjt+Cs[view] [source] [discussion] 2021-12-18 02:21:10
>>amarsh+R9
There's something even nicer than advisory locks, as of a few years ago. https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock...
◧◩
74. preeta+Qt[view] [source] [discussion] 2021-12-18 02:32:24
>>montro+Ec
> This is fantastic for relatively low volume queues with intensive work to be done by a small number of workers. For these types of use cases, I'll take this approach over RabbitMQ or Kafka all day long.

That’s what our workload is like for our SaaS code analysis platform. We create a few tasks (~10 max) for every customer submission (usually triggered by a code push). We replaced Kafka with a PostgreSQL table a couple of years ago.

We made the schema, functions, and Grafana dashboard open source [0]. I think it’s slightly out-of-date but mostly the same as what we have now in production, and has been running perfectly.

[0] https://github.com/ShiftLeftSecurity/sql-task-queue

◧◩◪
91. amarsh+nB[view] [source] [discussion] 2021-12-18 03:51:55
>>bjt+Cs
Some comments from brandur on SKIP LOCKED in https://github.com/brandur/sorg/pull/263. I haven’t looked into it too much since what we have works without any issues; but good to know, thanks!
◧◩
96. gurjee+bD[view] [source] [discussion] 2021-12-18 04:12:12
>>osigur+Fh
Supabase's Realtime [1] is one of the solutions that can help with that. Although it doesn't exactly let you LISTEN at scale, but it allows the applications to be notified on changes in the database.

> Listen to changes in a PostgreSQL Database and broadcasts them over WebSockets

[1]: https://github.com/supabase/realtime

Disclosure: I'm a Supabase employee.

◧◩◪
97. gurjee+CD[view] [source] [discussion] 2021-12-18 04:15:12
>>javajo+7D
You (the application) can certainly go on to issue other database commands. See the docs [1].

> With the libpq library, the application issues LISTEN as an ordinary SQL command, and then must periodically call the function PQnotifies to find out whether any notification events have been received.

[1]: https://www.postgresql.org/docs/current/sql-listen.html

108. nodeso+cH[view] [source] 2021-12-18 05:00:54
>>anonu+(OP)
Anybody using graphile-worker[1] in production/heavy load? It looks awesome, and I coded up some simple prototype tasks (email, sms, etc), but question how it truly scales. They claim horizontal scaling is trivial.

> graphile-worker is horizontally scalable. Each instance has a customisable worker pool, this pool defaults to size 1 (only one job at a time on this worker) but depending on the nature of your tasks (i.e. assuming they're not compute-heavy) you will likely want to set this higher to benefit from Node.js' concurrency.

[1] https://github.com/graphile/worker

134. kissgy+mY[view] [source] 2021-12-18 08:33:07
>>anonu+(OP)
If you are interested how to implement this in Python, check out this Gist: https://gist.github.com/kissgyorgy/beccba1291de962702ea9c237...

It's really simple. I used this snippet for a real-time application which can react to any kind of change in the database instantly.

◧◩◪◨⬒⬓⬔⧯▣▦
146. fanf2+y71[view] [source] [discussion] 2021-12-18 10:35:00
>>daenz+BP
I guess the clever lock semantics are SKIP LOCKED, which is designed to support efficient queues. The cleverness is inside PostgreSQL rather than in the application, other than the cleverness of knowing about this feature. https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock...
◧◩◪◨
155. cpursl+5j1[view] [source] [discussion] 2021-12-18 12:58:56
>>seanwi+nP
Considering that it's Elixir/Erlang, presumably millions: https://phoenixframework.org/blog/the-road-to-2-million-webs...
◧◩◪
156. cpursl+Ij1[view] [source] [discussion] 2021-12-18 13:06:51
>>gurjee+bD
I like Supabase's approach over pub/sub. One of the big advantages is they listen to the Postgres WAL which overcomes the 8000 bytes limitation[1] of the notify approach.

And Elixir is especially well suited for this type of workload. I actually extracted out much of the Supabase Realtime library so that I could work with the data directly in Elixir[2]

[1]: https://github.com/supabase/realtime#why-not-just-use-postgr...

[2]: https://github.com/cpursley/walex

◧◩◪
164. Dauros+5q1[view] [source] [discussion] 2021-12-18 14:12:46
>>anonu+ve
AFAIK the only one is Dramatiq [1] with dramatiq-pg [2]: a 3rd party message broker using Postgres LISTEN/NOTIFY.

[1]: https://dramatiq.io/

[2]: https://gitlab.com/dalibo/dramatiq-pg/

◧◩◪◨⬒⬓
174. Clumsy+NF1[view] [source] [discussion] 2021-12-18 16:11:42
>>brodou+iy1
The IoT hubs are an embedded system, built with a minimal memory footprint and overhead, 512 mb of ram is typical, sometimes less. Here is an example: https://www.gl-inet.com/products/gl-s1300/

That means you can't have docker and different versions of Java, node and .Net all running in parallel.

You run a single process and Sqlite is a library that allows SQL operations and database to be inbuilt. You 'budget' is like 100 mb of Ram, becauae other stuff has to run too.

All the time-series databases I know are a large, memory hungry hippo, built for distributed compute/kubernetes. Just very different usecase. If one was built with minimalism in mind, then it could be used.

◧◩◪
178. oftenw+iL1[view] [source] [discussion] 2021-12-18 16:44:13
>>dirkt+0J
https://impossibl.github.io/pgjdbc-ng/docs/0.8.9/user-guide/...

or

https://github.com/pgjdbc/r2dbc-postgresql#listennotify

181. gregor+pS1[view] [source] 2021-12-18 17:25:18
>>anonu+(OP)
I know there used to be some O(n^2) logic associated with notify. Anyone know if that ever got fixed? Cursory search only turns up https://postgrespro.com/list/thread-id/2407396

  The benchmark shows that the time to send notifications grows 
  quadratically with the number of notifications per transaction without 
  the patch while it grows linearly with the patch applied and 
  notification collapsing disabled.
[go to top]