I've used this for tasks at big organizations without issue. No need for any special deployments or new infra. Just spin up a few worker threads in your app. Perhaps a thread to reset abandoned tasks. But in three years this never actually happened, as everything was contained in try/catch that would add it back to the queue, and our java app was damn stable.
Just curious. We maintained a custom background processing system for years but recently replaced it with off the shelf stuff, so I'm really interested in how others are doing similar stuff.
Our tasks were quick enough so that all fetched tasks would always be able to be completed before a scale down / new deploy etc, but we stopped fetching new ones when the signal came so it just finished what it had. I updated above, we did have logic to monitor if a task got taken but never got a finished status, but I can't remember it ever actually reporting on anything.
My approach was:
- Accept the inbound call
- Generate a 20 character random string (used as a signature)
- Execute a sql query that selects the oldest job without a signature and write the signature, return the primary key of the job that was updated.
- If it errors for any reason, loop back and attempt again, but only 10 times, as some underlying issue exists (10 collisions is statistically improbable for my use case)
- Read the primary key returned by that sql query and read it, comparing it's signature to my random one.
- If a hit, return the job to the caller
- If a miss, loop back and start again, incrementing attempts by 1.
The caller has to handle the possibility that a call to this web service won't return anything, either due to no jobs existing, or the collision/error threshold being reached.
In either case, the caller backs for it's configured time, then calls again.
Callers are usually in 'while true' loops, only existing if they get an external signal to close or an uncontrolled crash.
If you take this approach, you will have a function or a web service that converts the SQL table into a job queue service. When you do that, you can build metrics on the amount of collisions you get while trying to pull and assign jobs to workers.
I had inbuilt processes that would sweep through jobs that were assigned (had a job signature) and weren't marked as complete, it actioned those to handle the condition of a crashed worker.
There are many many other services the proper job queues offer, but that usually means more dependencies, and code libraries / containers, so just build in the functionality you need.
If it is accurate, fast enough, and stable, you've got the best solution for you.
/edited for formatting
Edit: To clarify, I mean `SELECT id WHERE used = 0` followed by `UPDATE ... SET used = 1 WHERE id = ... AND used = 0`
When grabbing a new message it selects "Available or (Locked with Updated timestamp older than configured timeout)". If successful it immediately tries to set the Locked status, Updated timestamp and bumps the Version counter, where the previous values of Status and Version has to match. If the update fails it retries getting a new message.
If the Version counter is too high, it moves the message to the associated dead-letter table, and retries getting a new message.
This isn't for high performance. I tested it and got 1000 messages/sec throughput with handful of producers and consumers against test db instance (limited hardware), which would be plenty for us.
I wrote it to be simple and so we could easily move to something AMPQ'ish like RabbitMQ or Azure Service Bus when needed. Overall quite easy to implement and has served us well so far.
Why would both transactions see `used = 0`? The DB server tries to isolate transactions and actively hides effects of other transactions that have not committed yet.
https://www.postgresql.org/docs/current/transaction-iso.html...
There is actually another possibility: there must be a way to check whether the receiving system has received the message. But this only works if there are no "rogue" senders.