> committed atomically
Complex billing systems don't work that way. Worker processes are not always in these neat boxes of "done" or "not done". Much like rollbacks are a developer myth. If a process was that trivial then you wouldn't need a queue and workers in the first place!
> Also I'd detect a worker has died by recording the start-time and using a timeout.
There are many ways to solve this and many ways to get it wrong. Not working in UTC? Oops, better hope nothing runs during daylight savings changeover. Parent process died but worker finished job? Let's hope the parent isn't responsible for updating the job completion status. Large job is borderline on the timeout? Better hope parent process doesn't restart the job while the worker is still working on it. Network partition? Ut oh. CAP theorem says you're out of luck there (and typically there is at least one network hop between the DB server/controlling process and the system running the workers).
Probably the more straightforward solution is to give each worker an ID and let them update the database with the job they pick up. Then, something robust like systemd, would monitor and restart workers if they fail. When a worker starts, they find any jobs where table.worker_id = myId and then start back on those. But you still have network partitions to worry about. Again, not at all trivial.
The database records the start time, in the databases system time, in UTC.
The parent process can't update the job status, it can only cancel jobs.
There's a pool of workers. Each working may fetch a pending request or wait. Once it has fetched a request, which sets it as pending, no other worker is able to fetch this request. Only this worker may alter the status of its request and it may only set it to completed, and it may only do so with its results.
The worker periodically writes its status to a separate table. If such status has not been written in a timely manner, the worker is assumed to be stalled and a new child request is created, available to other workers.
Putting this logic into the database instead of a parent process, which managed a shared concurrent queue, has been a by far more pleasant solution. Granted, we're not talking large numbers of concurrent requests. Ten as opposed to ten thousand. If we're talking hundreds of concurrent, short requests, I can see this starting to become problematic.
I think there's huge difference in the needs of different implementations of queues and workers so I'm interested to hear more about workloads where this approach fails and specifically in which regards. In my experience, it's been pleasant.
Right, it doesn't have to be. Blockchain tech such as Bitcoin is effectively this sort of thing. The idealized type of queue worker would be a pure, referential transparent function that crunches some data and returns an output. More difficult to write but still acceptable: idempotent workers.
> the worker is assumed to be stalled and a new child request is created
This may work for your use case. I'd be worried about an infinite process loop here. Where a problematic job continually stalls a child process and gets entered back into the queue forever, because there needs to be logic to raise an exception and say: "hey, take a look at this job... something may be wrong with it" and then put that job in a holding area. This is also why I'm a big fan of Erlang, which has all of this batteries included with supervisors (they really were 20+ years ahead of their time on this stuff).
> I think there's huge difference in the needs of different implementations of queues
Yes, and people often aren't sure what their needs are. I've worked with RabbitMQ on a system that almost certainly didn't need it. And I've worked on systems that were nothing more than a file that lands via FTP on a folder and is moved (atomically, via /bin/mv) to indicate the job is pending. Using a file system as a queue is another interesting take on this sort of thing. No database even needed.
I mitigate this in several ways, first, only one worker can work on a job or child job. At worst, one worker is stuck in a loop. There's also a maximum limit of retries. So a job is never infinitely retried. Increasing the amount of stall time per request would be trivial but it's never become an issue. Monitoring potential infinite jobs is as simple as querying for jobs that reach the limit.
I'm sure Erlang or other actor model type implementations would handle this incredibly well at scale but it seems to me that just doing in the database works good enough for at least certain workloads.
As always, it comes down to a question of the right tool for the job. The advantage of the database is that it's mostly likely already in the stack. I don't doubt that there are situations where it's the wrong tool. That's why it's interesting to know at which point it's not a good solution, so that one can make a good decision.