zlacker

[parent] [thread] 6 comments
1. brandu+(OP)[view] [source] 2017-09-20 16:07:31
(Author here.)

I've taken fire before for suggesting that any job should go into a database, but when you're using this sort of pattern with an ACID-compliant store like Postgres it is so convenient. Jobs stay invisible until they're committed with other data and ready to be worked. Transactions that rollback discard jobs along with everything else. You avoid so many edge cases and gain so much in terms of correctness and reliability.

Worker contention while locking can cause a variety of bad operational problems for a job queue that's put directly in a database (for the likes of delayed_job, Que, and queue_classic). The idea of staging the jobs first is meant as a compromise: all the benefits of transactional isolation but with significantly less operational trouble, and at the cost of only a slightly delayed jobs as an enqueuer moves them out of the database and into a job queue.

I'd be curious to hear what people think.

replies(4): >>koolba+s3 >>thepti+ai >>geeio+yi >>troyk+g71
2. koolba+s3[view] [source] 2017-09-20 16:23:47
>>brandu+(OP)
> I've taken fire before for suggesting that any job should go into a database, but when you're using this sort of pattern with an ACID-compliant store like Postgres it is so convenient.

+1 to in database queues that are implemented correctly. The sanity of transactional consistency of enqueuing alone is worth it. I've used similar patterns as a staging area for many years.

This also allows for transactionally consistent error handling as well. If a job is repeatedly failing you can transactionally remove it from the main queue and add it to a dead letter queue.

replies(1): >>brandu+F5
◧◩
3. brandu+F5[view] [source] [discussion] 2017-09-20 16:36:32
>>koolba+s3
> This also allows for transactionally consistent error handling as well. If a job is repeatedly failing you can transactionally remove it from the main queue and add it to a dead letter queue.

Totally. This also leads to other operational tricks that you hope you never need, but are great the day you do. For example, a bad deploy queues a bunch of jobs with invalid arguments which will never succeed. You can open a transaction and go in and fix them in bulk using an `UPDATE` with jsonb select and manipulation operators. You can then even issue a `SELECT` to make sure that things look right before running `COMMIT`.

Again, something that you hope no one ever does in production, but a life saver in an emergency.

4. thepti+ai[view] [source] 2017-09-20 17:51:13
>>brandu+(OP)
Perhaps I'm missing something here, but in your example:

> Sidekiq.enqueue(job.job_name, *job.job_args)

You're doing all your enqueueing in a transaction, so if any enqueue call fails (e.g. network error) you'll break the transaction, and requeue all of your jobs (even those that were successfully delivered).

Given that you're lobbing the job outside of the DB transaction boundary, why have that transaction at all? It's not clear to me why all the jobs should share the same fate.

If you want at-least once message delivery, can't you configure that in your queue? (For example, RabbitMQ supports both modes; only ACKing after the task completes, or ACKing as soon as the worker dequeues the message).

I'm not familiar with Sidekiq, so maybe that's not an option there. But in that case, it's still not clear why you'd requeue all the tasks if one of them fails to be enqueued (or the delte fails); you could just decline to delete the row for the individual job that failed.

replies(1): >>brandu+Ww
5. geeio+yi[view] [source] 2017-09-20 17:53:26
>>brandu+(OP)
I do the same thing. Small projects start with the job queue in postgres.

As things eventually scale up, I move the queue to its own dedicated postgres node.

Once that starts to be too slow, I finally move to redis/kafka. 99% of things never make it to this stage.

◧◩
6. brandu+Ww[view] [source] [discussion] 2017-09-20 19:29:39
>>thepti+ai
> Given that you're lobbing the job outside of the DB transaction boundary, why have that transaction at all? It's not clear to me why all the jobs should share the same fate.

Good question. I should have probably put a comment to annotate that.

The transaction is there purely to wrap the `SELECT` statement, then later the `DELETE`. If it fails midway through, the process will restart, and you will get doubled up jobs inserted into your queue. This isn't very desirable, but it's better than the alternative, which is `DELETE` the jobs too early and then lose jobs when your process dies.

> If you want at-least once message delivery, can't you configure that in your queue? (For example, RabbitMQ supports both modes; only ACKing after the task completes, or ACKing as soon as the worker dequeues the message).

Your queue will also be configured to do this (i.e. in the case of Sidekiq, it won't fully free a job until it's been confirmed to have succeeded or failed on at worker), but you basically need to have at least once delivery between any two systems. So the enqueuer will hand jobs off to a queue with this guarantee, and the queue will then hand them off to its own workers.

> I'm not familiar with Sidekiq, so maybe that's not an option there. But in that case, it's still not clear why you'd requeue all the tasks if one of them fails to be enqueued (or the delte fails); you could just decline to delete the row for the individual job that failed.

Yeah, keep in mind that this a pretty naive code sample for purposes of simplicity. You can and should be more clever about it like you suggest if you're putting it in production. However in practice, enqueuing a job is likely to have a very high success rate, so probably this simple worker will do the trick in most cases.

7. troyk+g71[view] [source] 2017-09-21 00:31:39
>>brandu+(OP)
We do this and have great success with a small SMS/Email application doing ~2mil jobs a day (mostly over a 6hour peak period).

Except, we use postresql's LISTEN/NOTIFY which amazingly is also transaction aware (NOTIFY does not happen until the transaction is committed, and even more amazingly, sorta de-dupes itself!) to move the jobs from the database to the message queue.

This way we never lock the queue table. We run a simple go program on the postgresql server that LISTEN's, queries, pushes to REDIS and then deletes with a .25 second delay so we group the IO instead of processing each row individually.

This also allowed us to create jobs via INSERT FROM SELECT, which is awesome when your creating 50k jobs at a time.

[go to top]