Also you pass the data a job needs to run as part of the job payload. Then you don't have the "data doesn't exist" issue.
It is such a better model for the majority of queues. All you're doing is storing a message, hitting an HTTP endpoint and deleting the message on success. This makes it so much easier to scale, reason, and test task execution.
Update: since multiple people seem confused. I'm talking about the implementation of a job queue system, not suggesting that they use the GCP tasks product. That said, I would have just used GCP tasks too (assuming the usecase dictated it, fantastic and rock solid product.)
Wanting to offload heavy work to a background job is absolute as old of a best practice as exists in modern software engineering.
This is especially important for the kind of API and/or web development that a large number of people on this site are involved in. By offloading expensive work, you take that work out-of-band of the request that generated it, making that request faster and providing a far superior user experience.
Example: User sign-up where you want to send a verification email. Talking to a foreign API like Mailgun might be a 100 ms to multisecond (worst case scenario) operation — why make the user wait on that? Instead, send it to the background, and give them a tight < 100 ms sign up experience that's so fast that for all intents and purposes, it feels instant.
Passing around the job's data separately means that now you're storing two copies, which means you're creating a point where things can get out of sync.
We use this to ensure Kafka events are only emitted when a process succeeds, this is very similar.
The trouble with hitting an HTTP API to queue a task is: what if it fails, or what if you're not sure about whether it failed? You can continue to retry in-band (although there's a definite latency disadvantage to doing so), but say you eventually give up, you can't be sure that no jobs were queued which you didn't get a proper ack for. In practice, this leads to a lot of uncertainty around the edges, and operators having to reconcile things manually.
There's definite scaling benefits to throwing tasks into Google's limitless compute power, but there's a lot of cases where a smaller, more correct queue is plenty of power, especially where Postgres is already the database of choice.
Good luck with a long running batch.
This is covered in the GCP Tasks documentation.
> There's definite scaling benefits to throwing tasks into Google's limitless compute power, but there's a lot of cases where a smaller, more correct queue is plenty of power, especially where Postgres is already the database of choice.
My post was talking about what I would implement if I was doing my own queue, as the authors were. Not about using GCP Tasks.
Again, I'm getting downvoted. The whole point of my comment isn't about using GCP Tasks, it is about what I would do if I was going to implement my own queue system like the author did.
By the way, that 30 minute limitation can be worked around with checkpoints or breaking up the task into smaller chunks. Something that isn't a bad idea to do anyway. I've seen long running tasks cause all sorts of downstream problems when they fail and then take forever to run again.
Benchmark: peaks at around 17,699 jobs/sec for one queue on one node. Probably covers most apps.
https://getoban.pro/articles/one-million-jobs-a-minute-with-...
Yes. I am intimately familiar with background jobs. In fact I've been using them long enough to know, without hesitation, that you don't use a relational database as your job queue.
Agreed. Which is why the design doesn't make any sense. Because in the scenario presented they're starting a job during a transaction.
I wonder maybe if you've limited yourself by assuming relational DBs only have features for relational data. That isn't the case now and really hasn't been the case for quite some time now.
No, we don't operate like that. Call me out when I'm wrong technically, but don't tell me that because someone is some sort of celebrity that I should cut them some slack.
Everything he pointed out is literally covered in the GCP Tasks documentation.
Transactional job queues have been a recurring theme throughout my career as a backend and distributed systems engineer at Heroku, Opendoor, and Mux. Despite the problems with non-transactional queues being well understood I keep encountering these same problems. I wrote a bit about them here in our docs: https://riverqueue.com/docs/transactional-enqueueing
Ultimately I want to help engineers be able to focus their time on building a reliable product, not chasing down distributed systems edge cases. I think most people underestimate just how far you can get with this model—most systems will never outgrow the scaling constraints and the rest are generally better off not worrying about these problems until they truly need to.
Please check out the website and docs for more info. We have a lot more coming but first we want to iron out the API design with the community and get some feedback on what features people are most excited for. https://riverqueue.com/
We've also had a lot of experience with with other libraries like Que ( https://github.com/que-rb/que ) and Sidekiq (https://sidekiq.org/) which have certainly influenced us over the years.
You're being "called out" (ugh) incredibly politely mostly because you were being a bit rude; "tell me X without telling me" is just a bit unpleasant, and totally counterproductive.
> because someone is some sort of celebrity that I should cut them some slack.
No one mentioned a celebrity. You're not railing against the power of celebrity here; just a call for politeness.
> Everything he pointed out is literally covered in the GCP Tasks documentation.
Yes, e.g. as pitfalls.
They’re surprisingly easy to implement in plain SQL:
[1] https://taylor.town/pg-task
The nice thing about this implementation is that you can query within the same transaction window
The example on the home page makes this clear where a user is created and a job is created at the same time. This ensures that the job is queued up with the user creation. If any parts of that initial transaction fails, then the job queuing doesn't actually happen.
Just skimming the docs, can you add a job directly via the DB? So a native trigger could add a job in? Or does it have to go via a client?
I'd be curious to compare performances once you guys are comfortable with that, we do them openly and everyday on: https://github.com/windmill-labs/windmill/tree/benchmarks
I wasn't aware of the skip B-tree splits and the REINDEX CONCURRENTLY tricks. But curious what do you index in your jobs that use those. We mostly rely on the tag/queue_name (which has a small cardinality), scheduled_for, and running boolean which don't seem good fit for b-trees.
The request to get a message returns a token that identifies this receive.
You use that token to delete the message when you are done.
Jobs that don’t succeed after N retries get marked as dead and go into the dead letter list.
This the way AWS SQS works, it’s tried and true.
Personally, I need long running jobs.
What’s the goal for the project? Is it to be commercial? If so you face massive headwind because it’s so incredibly easy to implement a queue now.
I'm also very familiar with jobs and I have used the usual tools like Redis and RMQ, but I wouldn't make a blanket statement like that. There are people using RDBS as queues in prod so we have some counter-examples. I wouldn't mind at all to get rid of another system (not just one server but the cluster of RMQ/Redis you need for HA). If there's a big risk in using pg as backend for a task queue, I'm all ears.
e.g.,
1. Application starts transaction 2. Application updates DB state (business details) 3. Application enqueues job in Redis 4. Redis jobworkers pick up job 5. Redis jobworkers error out 6. Application commits transaction
This motivates placing the jobworker state in the same transaction whereas non-DB based job queues have issues like this.
> We're hard at work on more advanced features including a self-hosted web interface. Sign up to get updates on our progress.
- http libraries
- webservers
- application servers
- load balancers
- reverse proxy servers
- the cloud platform you're running on
- waf
It might be alright for smaller "tasks", but not for "jobs".Thank you for this work, I look forward to taking it for a (real) test drive!
* neoq: https://github.com/acaloiaro/neoq
* gue: https://github.com/vgarvardt/gue
Neoq is new and we found it to have some features (like scheduling tasks) that were attractive. The maintainer has also been responsive to fixing our bug reports and addressing our concerns as we try it out.
Gue has been around for a while and is probably serving its users well.
Looking forward to trying out River now. I do wonder if neoq and river might be better off joining forces.
You said back then that you planned on pursuing a Go client; now, four years later, here we are. River looks excellent, and the blog post does a fantastic job explaining all the benefits of job queues in Postgres.
Oban just went the opposite way, removing the use of database triggers for insert notifications and moving them into the application layer instead[1]. The prevalence of poolers like pgbouncer, which prevent NOTIFY ever triggering, and the extra db load of trigger handling wasn't worth it.
[1]: https://github.com/sorentwo/oban/commit/7688651446a76d766f39...
I am a bit confused by the choice of the LGPL 3.0 license. It requires one to dynamically link the library to avoid GPL's virality, but in a language like Go that statically links everything, it becomes impossible to satisfy the requirements of the license, unless we ignore what it says and focus just on its spirit. I see that was discussed previously by the community in posts such as these [1][2][3]
I am assuming that bgentry and brandur have strong thoughts on the topic since they avoided the default Go license choice of BSD/MIT, so I'd love to hear more.
[1] https://www.makeworld.space/2021/01/lgpl_go.html [2] https://golang-nuts.narkive.com/41XkIlzJ/go-lgpl-and-static-... [3] https://softwareengineering.stackexchange.com/questions/1790...
Starting with the project's tagline, "Robust job processing in Elixir", let's see what else:
- The same job states, including the British spelling for `cancelled`
- Snoozing and cancelling jobs inline
- The prioritization system
- Tracking where jobs were attempted in an attempted_by column
- Storing a list of errors inline on the job
- The same check constraints and the same compound indexes
- Almost the entire table schema, really
- Unique jobs with the exact same option names
- Table-backed leadership election
Please give some credit where it's due.I'll try to work this into the higher level docs website later today with an example :)
It's particularly suited to use cases such background jobs, workflows or other operations which occur within your application and scales well enough for what 99.9999% of us will be doing.
Some of what you've mentioned are cases where we surveyed a variety of our favorite job engines and concluded that we thought Oban's way was superior, whereas others we cycled through a few different implementations before ultimately apparently landing in a similar place. I'm not quite sure what to say on the spelling of "cancelled" though, I've always written it that way and can't help but read "canceled" like "concealed" in my head :)
As I think I mentioned when we first chatted years ago this has been a hobby interest of mine for many years so when a new database queue library pops up I tend to go see how it works. We've been in a bit of a mad dash trying to get this ready for release and didn't even think about crediting the projects that inspired us, but I'll sync with Brandur and make sure we can figure out the right way to do that.
I really appreciate you raising your concerns here and would love to talk further if you'd like. I just sent you an email to reconnect.
- Use FOR NO KEY UPDATE instead of FOR UPDATE so you don't block inserts into tables with a foreign key relationship with the job table. [1]
- We parallelize worker by tenant_id but process a single tenant sequentially. I didn't see anything in the docs about that use case; might be worth some design time.
[1]: https://www.migops.com/blog/select-for-update-and-its-behavi...
JFC One line of code you don't even have to test
For our particular use case, I think we're actually not using notify events. We just insert rows into the outbox table and the poller re-emits as kafka events and deletes successfully emitted events from the table.
Unlike https://github.com/tembo-io/pgmq a project we've been working on at Tembo, many queue projects still require you to run and manage a process external to the database, like a background worker. Or they ship as a client library and live in your application, which will limit the languages you can chose to work with. PGMQ is a pure SQL API, so any language that can connect to Postgres can use it.
I don't see Sidekiq credited on the main page of Oban.
While there is no overlap in technology or structure with Sidekiq, the original Oban announcement on the ElixirForum mentions it along with all of the direct influences:
https://elixirforum.com/t/oban-reliable-and-observable-job-p...
A few years ago I wrote my own in house distributed job queue and scheduler in Go on top of Postgres and would have been very happy if a library like this had existed before.
The two really are a great pair for this usecase for most small to medium scaled applications, and it's awesome to see someone putting a library out there publicly doing it - great job!!
a job queue might just be the tip of the use cases iceberg... isn't it?
in the end it's a pub/sub - I use nats.io workers for this.
arf, just read a few comments on this same line down bellow.
Along the lines of:
_, err := river.Execute(context.Background(), j) // Enqueue the job, and wait for completion
if err != nil {
log.Fatalf("Unable to execute job: %s", err)
}
log.Printf("Job completed")
Does that make sense?Adding temporal.io means introducing a third component. More components usually means more complexity. More complexity means more difficult to test, develop, debug and deploy.
As with everything, it's all about tradeoffs.
I plan to use Orleans to handle a lot of the heavy HA/scale lifting. It can likely stand in for Redis in a lot of cache use cases(in some non-obvious ways), and am anticipating writing a Postgres stream provider for it when the time comes.. Will likely end up writing a Postres job queue as well so will definitely check out River for inspiration.
A lot of postgres drivers, including the .Net defacto Npgsql, support logical decode these days which unlocks a ton of exciting use cases and patterns via log processing.
Cancelled has nice pairing with cancellation, canceled can be typed nicely without any repeated finger use on qwerty, both clearly mean the same thing and aren't confused with something else... I say let the battles begin, and may the best speling win.
Referer pains me though.
One solution is the outbox pattern:
https://microservices.io/patterns/data/transactional-outbox....
In theory an append-only and/or HOT strategy leaning on Postgres just ripping through moderate sized in-mem lists could be incredibly fast. Design would be more complicated and perhaps use case dependent but I bet could be done.
River's professional looking website makes me think there are more commercial ambitions behind River than neoq, but maybe I'm wrong. I do think they're very similar options, but neoq has no commercial ambitions. I simply set out to create a good, open source, queue-agnostic solution for Go users.
I'm actually switching its license to BSD or MIT this week to highlight that fact.
It's probably more correct to say that the engineering effort required to make a Postgres-as-a-queue scale horizontally is a lot more than the engineering effort required to make a dedicated queueing service scale horizontally. The trade-off is that you're typically going to have to start scaling horizontally much sooner with your dedicated queuing service than with a Postgres database.
The argument for Postgres-as-a-queue is that you might be able to get to market much quicker, which can be significantly more important than how well you can scale down the track.