Choose Postgres queue technology

>>bo0tzz+(OP)
For several projects I’ve opted for the even dumber approach, that works out of the box with every ORM/Query DSL framework in every language: using a normal table with SELECT FOR UPDATE SKIP LOCKED

https://www.pgcasts.com/episodes/the-skip-locked-feature-in-...

It’s not “web scale” but it easily extends to several thousand background jobs in my experience

>>bo0tzz+(OP)
> I’d love to see more neoq-like libraries for languages other than Go.

Python has Celery, but maybe the author is looking for more choice between brokers. https://docs.celeryq.dev/en/stable/index.html

>>bo0tzz+(OP)
For running queues on Postgres with Node.js backend(s), I highly recommend https://github.com/timgit/pg-boss. I'm sure it has it scale limits. But if you're one of the 90% of the apps that never needs any kind of scale that a modern server can't easily handle then it's fantastic. You get transactional queueing of jobs, and it automatically handles syncing across multiple job processing servers using Postgres locks.

>>bo0tzz+(OP)
I do enjoy using https://github.com/graphile/worker for my postgresql queuing needs. Very scalable, the next release 0.14 even more so, and easy to use.

>>bo0tzz+(OP)
For Rails apps, you can do this using the ActiveJob interface via

https://github.com/bensheldon/good_job

Had it in production for about a quarter and it’s worked well.

>>bo0tzz+(OP)
Few things.

1. The main downside to using PostgreSQL as a pub/sub bus with LISTEN/NOTIFY is that LISTEN is a session feature, making it incompatible with statement level connection pooling.

2. If you are going to do this use advisory locks [0]. Other forms of explicit locking put more pressure on the database while advisory locks are deliberately very lightweight.

My favorite example implementation is que [1] which is ported to several languages.

[0] https://www.postgresql.org/docs/current/explicit-locking.htm...

[1] https://github.com/que-rb/que

>>bo0tzz+(OP)
MS SQL server, Postgres and MySQL all support SKIP LOCKED, which means they are all suitable for running queues.

I built a complete implementation in Python designed to work the same as SQS but be more simple:

https://github.com/starqueue/starqueue

Alternatively if you just want to quickly hack something into your application, here is a complete solution in one Python function with retries (ask ChatGPT to tell you what the table structure is):

    import psycopg2
    import psycopg2.extras
    import random
    
    db_params = {
        'database': 'jobs',
        'user': 'jobsuser',
        'password': 'superSecret',
        'host': '127.0.0.1',
        'port': '5432',
    }
    
    conn = psycopg2.connect(**db_params)
    cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
    
    
    def do_some_work(job_data):
        if random.choice([True, False]):
            print('do_some_work FAILED')
            raise Exception
        else:
            print('do_some_work SUCCESS')
    
    def process_job():
        sql = """DELETE FROM message_queue
        WHERE id = (
            SELECT id
            FROM message_queue
            WHERE status = 'new'
            ORDER BY created ASC
            FOR UPDATE SKIP LOCKED
            LIMIT 1
        )
        RETURNING *;
        """
        cur.execute(sql)
        queue_item = cur.fetchone()
        print('message_queue says to process job id: ', queue_item['target_id'])
        sql = """SELECT * FROM jobs WHERE id =%s AND status='new_waiting' AND attempts <= 3 FOR UPDATE;"""
        cur.execute(sql, (queue_item['target_id'],))
        job_data = cur.fetchone()
        if job_data:
            try:
                do_some_work(job_data)
                sql = """UPDATE jobs SET status = 'complete' WHERE id =%s;"""
                cur.execute(sql, (queue_item['target_id'],))
            except Exception as e:
                sql = """UPDATE jobs SET status = 'failed', attempts = attempts + 1 WHERE id =%s;"""
                # if we want the job to run again, insert a new item to the message queue with this job id
                cur.execute(sql, (queue_item['target_id'],))
            else:
                print('no job found, did not get job id: ', queue_item['target_id'])
                conn.commit()
    
    process_job()
    cur.close()
    conn.close()

>>bo0tzz+(OP)
Here is a Python example how to use it: https://gist.github.com/kissgyorgy/beccba1291de962702ea9c237...

>>bo0tzz+(OP)
there's an important dimension of scalability that I think gets overlooked in a lot of these discussions about database-as-a-queue vs queue-system-as-a-queue:

are you queuing jobs, or are you queuing messages?

that's a fuzzy distinction, so somewhat equivalently, what's the expected time it takes for a worker to process a given queue item?

at one end, an item on the queue may take several seconds to a minute or longer to process. at the other end, an item might take only a few milliseconds to process. in that latter case, it's often useful to do micro-batching, where a single worker pulls 100 or 1000 items off the queue at once, and processes them as a batch (such as by writing them to a separate datastore)

the "larger" the items are (in terms of wall-clock processing time, not necessarily in terms of size in bytes of the serialized item payload) the more effective the database-as-a-queue solution is, in my experience.

as queue items get smaller / shorter to process, and start to feel more like "messages" rather than discrete "jobs", that's when I tend to reach for a queue system over a database.

for example, there's a RabbitMQ blog post [0] on cluster sizing where their recommendations start at 1000 messages/second. that same message volume on a database-as-a-queue would require, generally speaking, 3000 write transactions per second (if we assume one transaction to enqueue the message, one for a worker to claim it, and one for a worker to mark it as complete / delete it).

can Postgres and other relational databases be scaled & tuned to handle that write volume? yes, absolutely. however, how much write volume are you expecting from your queue workload, compared to the write volume from its "normal database" workload? [1]

I think that ends up being a useful heuristic when deciding whether or not to use a database-as-a-queue - will you have a relational database with a "side gig" of acting like a queue, or will you have a relational database that in terms of data volume is primarily acting like a queue, with "normal database" work relegated to "side gig" status?

0: https://blog.rabbitmq.com/posts/2020/06/cluster-sizing-and-o...

1: there's also a Postgres-specific consideration here where a lot of very short-lived "queue item" database rows can put excessive pressure on the autovacuum system.

>>bo0tzz+(OP)
I maintain QueueClassic (https://github.com/QueueClassic/queue_classic) for Rails/Ruby folks; which is basically what you're talking about - a queuing system for Postgres. A bonus reason, and why I originally wanted this was the ability to use transactions fully - i.e. I can start one, do some stuff, add a job in to the queue (to send an email), .....and either commit, or roll back - avoiding sending the email. If you use resque, I found sometimes either you can't see the record (still doing other stuff and it's not committed), or it's not there (rollback) - so either way you had to deal with it.

QC (and equivs) use the same db, and same connection, so same transaction. Saves quite a bit of cruft.

>>bo0tzz+(OP)
You don't even need a database to make a message queue. The Linux file system makes a perfectly good basis for a message queue since file moves are atomic.

My guess is that many people are implementing queuing mechanisms just for sending email.

You can see how this works in Arnie SMTP buffer server, a super simple queue just for emails, no database at all, just the file system.

https://github.com/bootrino/arniesmtpbufferserver

>>bo0tzz+(OP)
We use exactly this for windmill (OSS Retool alternative + modern airflow) and run benchmarks everyday. On a modest github CI instance where one windmill worker and postgres run as containers, our benchmarks run at 1200jobs/s. Workers can be added and it will scale gracefully up to 5000jobs/s. We are exploring using Citus to cross the barrier of 5000j/s on our multi-tenant instance.

https://github.com/windmill-labs/windmill/tree/benchmarks

>>aduffy+82
I recently published a manifesto and code snippets for exactly this in Postgres!

  delete from task
  where task_id in
  ( select task_id
    from task
    order by random() -- use tablesample for better performance
    for update
    skip locked
    limit 1
  )
  returning task_id, task_type, params::jsonb as params

[1] https://taylor.town/pg-task

>>bo0tzz+(OP)
One of my favourite pieces of writing about worker queues is this by Brandur Leach:

Transactionally Staged Job Drains in Postgres - https://brandur.org/job-drain

It's about the challenge of matching up transactions with queues - where you want a queue to be populated reliably if a transaction completes, and also reliably NOT be populated if it doesn't.

Brandur's pattern is to have an outgoing queue in a database table that gets updated as part of that transaction, and can then be separately drained to whatever queue system you like.

>>pauldd+f9
To do anything safe and interesting you’ll need transactions. Using SKIP LOCKED won’t be your bottleneck, your application will. Job queues are about side effects and the rest of your application needs to keep up.

Oban is able to run over 1m jobs a minute, and the ultimate bottleneck is throttling in application code to prevent thrashing the database: https://getoban.pro/articles/one-million-jobs-a-minute-with-...

>>aduffy+82
This is more or less how graphile, https://github.com/graphile/worker is implemented.

>>bo0tzz+(OP)
Skype used postgres as queue with a small plugin to process all their CDR many years ago. I have no idea if it used these days but it was 'web scale', 10 years ago. Just working, while people on the internet argued about using a database as a queue is an anti-pattern.

Having transactions is quite handy.

https://wiki.postgresql.org/wiki/SkyTools

I did a few talks on this at Sydpy as I used it at work quite a bit. It's handy when you already have postgresql running well and supported.

This said, I'd use a dedicated queue these days. Anything but RabbitMQ.

>>mbb70+ne
I think you mean findOneAndUpdate, and while simple, I wouldn't call it solid.

https://stackoverflow.com/a/76821755

>>bo0tzz+(OP)
This is exactly what the Oban https://getoban.pro/ Elixir library uses and combining postgres plus actors for queues scales pretty great for 90% of the needs out there. I have used it at my last few jobs at pretty decent scale and would take it over 10 years using Celery to manage queues + supervisord, setting up RabbitMQ or Redis. Its so simple you only need Elixir and Postgres and not 3 or 4 infrastructure pieces to manage a queue.

>>abraae+Jt
No time at the moment to break it all down, but here's a previous discussion thread on a similar topic.

>>27483003

>>jpgvm+La
One reason that makes me dislike NOTIFY/LISTEN is that issues with it are hard to diagnose.

Recently I had to stop using it because after a while all NOTIFY/LISTENS would stop working, and only a database restart would fix the issue https://dba.stackexchange.com/questions/325104/error-could-n...

>>bo0tzz+(OP)
I think this may help. A control plane for PostGre and Kafka and Pulsar. https://github.com/apecloud/kubeblocks/blob/main/docs/releas...

>>aduffy+82
In my experience, a queue system is the worst thing to find out isn't scaling properly because once you find out your queue system can't architecturally scale, there's no easy fix to avoid data loss. You talk about "several thousand background jobs" but generally, queues are measured in terms of Little's Law [1] for which you need to be talking about rates; according to Little's Law namely average task enqueue rate per second and average task duration per second. Raw numbers don't mean that much.

In the beginning you can do a naive UPDATE ... SET, which locks way too much. While you can make your locking more efficient, doing UPDATE with SELECT subqueries for dequeues and SELECT FOR UPDATE SKIP LOCKED, eventually your dequeue queries will throttle each other's locks and your queue will grind to a halt. You can try to disable enqueues at that point to give your DB more breathing room but you'll have data loss on lost enqueues and it'll mostly be your dequeues locking each other out.

You can try very quickly to shard out your task tables to avoid locking and that may work but it's brittle to roll out across multiple workers and can result in data loss. You can of course drop a random subset of tasks but this will cause data loss. Any of these options is not only highly stressful in a production scenario but also very hard to recover from without a ground-up rearchitecture.

Is this kind of a nightmare production scenario really worth choosing Boring Technology? Maybe if you have a handful of customers and are confident you'll be working at tens of tasks per second forever. Having been in the hot seat for one of these I will always choose a real queue technology over a database when possible.

[1]: https://en.wikipedia.org/wiki/Little%27s_law

>>Karrot+sQ
> and are confident you'll be working at tens of tasks per second forever.

It's more like a few thousand per second, and enqueues win, not dequeues like you say... on very small hardware without tuning. If you're at tens of tasks per second, you have a whole lot of breathing room: don't build for 100x current requirements.

https://chbussler.medium.com/implementing-queues-in-postgres...

> eventually your dequeue queries will throttle each other's locks a

This doesn't really make sense to me. To me, the main problem seems to be that you end up with having a lot of snapshots around.

>>mlyle+zT
> https://chbussler.medium.com/implementing-queues-in-postgres...

This link is simply raw enqueue/dequeue performance. Factor in workers that perform work or execute remote calls and the numbers change. Also, I find when your jobs have high variance in times, performance degrades significantly.

> This doesn't really make sense to me. To me, the main problem seems to be that you end up with having a lot of snapshots around.

The dequeuer needs to know which tasks to "claim", so this requires some form of locking. Eventually this becomes a bottleneck.

> don't build for 100x current requirements

What happens if you get 100x traffic? Popularity spikes can do it, so can attacks. Is the answer to just accept data loss in those situations? Queue systems are super simple to use. I'm counting "NOTIFY/LISTEN" on Postgres as a queue, because it is a queue from the bottom up.

>>bo0tzz+(OP)
There are a few mentions of Oban [1] here. Most people don't realise that Oban in fact uses SKIP LOCKED [2] as well.

Oban's been great, especially if you pay for Web UI and Pro for the extra features [3]

The main issue we've noticed though is that due to its simple fetching mechanism using locks, jobs aren't distributed evenly across your workers due to the greedy `SELECT...LIMIT X` [2]

If you have long running and/or resource intensive jobs, this can be problematic. Lets say you have 3 workers with a local limit of 10 per node. If there are only 10 jobs in the queue, the first node to fetch available jobs will grab and lock all 10, with the other 2 nodes sitting idle.

[1] https://github.com/sorentwo/oban [2] https://github.com/sorentwo/oban/blob/main/lib/oban/engines/... [3] https://getoban.pro/#feature-comparison

>>surpri+kz
Long running transactions can lead to an accumulation of dead tuples: https://brandur.org/postgres-queues

>>ritzac+L21
> If I want to use NOTIFY in postgres?

The nice thing about "boring" tech like Postgres is that it has great documentation. So just peruse https://www.postgresql.org/docs/current/sql-notify.html . No need for google-fu.

>>bo0tzz+(OP)
Some time ago, I wrote a queue using SQLite[0]. Instead of SKIP LOCKED, you can use RETURNING to lock-and-read a message and ensure only one worker is going to pick it up:

  UPDATE ... SET status = 'locked' ... RETURNING message_id

Or you can just use an IMMEDIATE transaction, SELECT the next message ID to retrieve, and UPDATE the row.

On top of that, if you want to be extra safe, you can do:

  UPDATE Queue SET status = 'locked' WHERE status = 'ready' AND message_id = '....'

To make sure you that the message you are trying to retrieve hasn't been locked already by another worker.

[0]: https://github.com/litements/litequeue/

[1]: https://github.com/litements/litequeue/blob/3fece7aa9e9a31e4...

>>klause+L61
This is not true in postgres. When the second transaction tries to update the row, it will wait for the first transaction to commit first and then recheck the WHERE.

https://www.postgresql.org/docs/current/transaction-iso.html...

>>runeks+R21
This is what we do. Even put together a library for it:

https://github.com/cpursley/walex

>>bo0tzz+(OP)
I just want to commend OP - if they’re here - for choosing an int64 for job IDs, and MD5 for hashing the payload in Neoq, the job library linked [0] from the article.

Especially given the emphasis on YAGNI, you don’t need a UUID primary key, and all of its problems they bring for B+trees (that thing RDBMS is built on), nor do you need the collision resistance of SHA256 - the odds of you creating a dupe job hash with MD5 are vanishingly small.

As to the actual topic, it’s fine IFF you carefully monitor for accumulating dead tuples, and adjust auto-vacuum for that table as necessary. While not something you’d run into at the start, at a modest scale you may start to see issues. May. You may also opt to switch to Redis or something else before that point anyway.

EDIT: if you choose ULID, UUIDv7, or some other k-sortable key, the problem isn’t nearly as bad, but you still don’t need it in this situation. Save yourself 8 bytes per key.

[0]: https://github.com/acaloiaro/neoq

>>bo0tzz+(OP)
I feel like one of the problems with using Postgres as a queue is that it’s hard to get started. There’s a lot you need to know. Getting started with something like Pub/Sun on GCP is much easier for many developers.

I’ve experimented with making this easier via libraries that provide high-level APIs for using Postgres as a queue and manage the schemas, listen/notify, etc for you: https://github.com/adriangb/pgjobq

>>bo0tzz+(OP)
> For example, this Hacker News comment stated that using Postgres this way is “hacky” and the commenter received no pushback. I found the comment to be load of BS and straw man arguments. This thinking seems to be “the prevailing wisdom” of the industry – if you want to talk about queue technology in public, it better not be a relational database.

I don't think that there's anything wrong with using a database as a queue, however, I think that there probably could have been better ways to get across the idea, rather than just dismissing an honest opinion as BS. I don't necessarily agree with all of what was said there, but at the same time I can see why those arguments would be reasonable: >>20022572

For example:

> Because it is hacky from the perspective of a distributed system architecture. It's coupling 2 components that probably ought not be coupled because it's perceived as "convenient" to do so. The idea that your system's control and data planes are tightly coupled is a dangerous one if your system grows quickly.

To me, this makes perfect sense, if you're using the same database instance for the typical RDBMS use case AND also for the queue. Then again, that could be avoided by having separate database instances/clusters and treating those as separate services: prod-app-database and prod-queue-database.

That said, using something like RabbitMQ or another specialized queue solution might also have the additional benefit of bunches of tutorials and libraries, as well as other resources available, which is pretty much the case whenever you have a well known and a more niche technology, even when the latter might be in some ways better! After all, there is a reason why many would use Sidekiq, resque, rq, Hangfire, asynq and other libraries that were mentioned and already have lots of content around them.

Though whether the inherent complexity of the system or the complexity of your code that's needed to integrate with it is more important, is probably highly situational.

>>jarofg+W2
Celery is crap. Full of bugs. https://wakatime.com/blog/56-building-a-background-task-queu...

>>jasong+iL
> I’m not against using Postgres for this. But I am against the rolling your own distributed task queue.

Good thing I didn't listen to your advice... my DIY background task queue saved my website when Celery couldn't scale. Why are you against rolling your own task queue besides it seeming complicated?

https://wakatime.com/blog/56-building-a-background-task-queu...

>>bo0tzz+(OP)
When I wrote my own background task queue I looked at Postgres, because it was already in use in the stack. Postgres would work for a simple queue, but supporting queue priorities, delayed/eta tasks, and broadcast tasks was too complicated. I decided on Redis, and it's scaled very well over the last year:

https://github.com/wakatime/wakaq

We currently process ~20 million tasks per day, and I don't have to worry about running VACUUM on my queue ;)

>>mtlgui+tf1
I've explored this space pretty thoroughly, including the Dynamo approach you've described. Dynamo does not have a strict guarantee on when items get deleted:

  TTL typically deletes expired items within a few days. Depending on the size and activity level of a table, the actual delete operation of an expired item can vary. Because TTL is meant to be a background process, the nature of the capacity used to expire and delete items via TTL is variable (but free of charge). [0]

Because of that limitation, I would not use that approach. Instead I would do Scheduled Lambdas to check for items every 15 minutes in a Serverless Aurora and then add them to SQS with delays.

I've had my eye on this problem for a few years and keep thinking that a simple SaaS that does one-shot scheduled actions would probably be a worthy side project. Not enough to build a company around, but maintenance would be low and there's probably some pricing that would attract enough customers to be sustainable.

[0] https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

>>foreig+pV
PGMQ does not require a client library, https://github.com/tembo-io/pgmq so long as your language of choice can run SQL. All the functions live in Postgres, and you just call them with SQL statement. Very similar feel and semantics to SQS.

>>pqdbr+9M
Sidekiq will drop in-progress jobs when a worker crashes. Sidekiq Pro can recover those jobs but with a large delay. Sidekiq is excellent overall but it’s not suitable for processing critical jobs with a low latency guarantee.

https://github.com/sidekiq/sidekiq/wiki/Reliability

>>pas+JN2
slower to run, but when you keep the postgres connection open you will know that the job is still running, while with for update skip locked you would need to have a status and a job_timeout basically.

so pg_try_advisory_lock/pg_advisory_unlock can lock over transactions while for update skip locked can't, thus you would either need to keep a transaction open or use status+job_timeout (and in postgres you should not use long transactions)

basically we use c#, but we looked into https://github.com/que-rb/que which uses advisory_locks, since our jobs take like 1 min to 2 hours it was a no-brainer to use advisory_locks. it's just not the best thing if you have thousands of fast jobs per second, but for a more moderate queue where you have like 10000 jobs per minute/10 minutes/30 minutes and they take like 1 min to 2 hours its fine.

we also do not delete jobs, we do not care about storage since the job table basically does not take a lot. and we have a lot of time to catchup at night since we are only in europe

>>djur+TO
> Can you define "low throughput"?

IDK maybe <1000 messages per minute

Not saying SKIP LOCKED can't work with that many. But you'll probably want to do something with lower overhead.

FWIW, Que uses advisory locks [1]

[1] https://github.com/que-rb/que

>>xpe+wT2
e.g. Fallow's 1898 dictionary of synonyms and antonyms: https://imgur.com/a/65yLkw7

https://archive.org/download/completedictiona00falluoft

There's many uses in British literature of the 1800's, and a whole lot of uses in academic literature of the 70's to 80's. https://i.imgur.com/BhMv2nF.png "Disabuse" would fit into many of these slots, but not all.

Only common use now is RPG jargon; imbuing something with an attribute is something role playing nerds talk about, and it really needs an antonym.

>>ekidd+zU2
Isn’t this just >>29599132 with the addition of a custom worker ID (pod name)?

>>nsonha+if4
Here you go. One of the first tutorials explaining how SKIP LOCKED works in Postgres implants a job “queue” that doesn’t have an order by clause. https://www.pgcasts.com/episodes/the-skip-locked-feature-in-...

I’m not confusing anything. I’ve seen random selection “job queues” implemented many times. As long as you truly don’t care about start order, it’s fine to trade it for increased throughout.

zlacker

Choose Postgres queue technology