zlacker

[return to "Do you really need Redis? How to get away with just PostgreSQL"]
1. deckar+3d[view] [source] 2021-06-12 09:27:53
>>hyzyla+(OP)
I imagine most people using Redis as a queue were already using it as a cache and just needed some limited queuing ability. Much like how places end up using a DB as a queue.

Using a DB as a queue has been a thing for a very long time. Every billing system I've seen is a form of a queue: at a certain point in the month a process kicks off that scans the DB and bills customers, marking their record as "current".

The challenge is always going to be: what if the worker dies. What if the worker dies, the job is re-ran, and the customer is billed twice. Thank god it's been many years since I've had to touch cron batch jobs or queue workers. The thought of leaving the office knowing some batch job is going to run at 3am and the next morning might be total chaos... shudder.

◧◩
2. cerved+pp[view] [source] 2021-06-12 11:48:50
>>deckar+3d
How would double billing occur if the worker dies. The way I would design this, the billing request and bill would be committed atomically such that a request can only be completed with exactly one associated bill. If the worker dies, no bill is created.

Also I'd detect a worker has died by recording the start-time and using a timeout. Furthermore I'd requeue requests as distinct new entities. A requeued entity would have a self-referencing nullable FK to reference its parent request.

◧◩◪
3. deckar+DY[view] [source] 2021-06-12 17:31:14
>>cerved+pp
Murphy's law says that you're going to screw this up any number of ways. Maybe not you, specifically, but perhaps your coworker.

> committed atomically

Complex billing systems don't work that way. Worker processes are not always in these neat boxes of "done" or "not done". Much like rollbacks are a developer myth. If a process was that trivial then you wouldn't need a queue and workers in the first place!

> Also I'd detect a worker has died by recording the start-time and using a timeout.

There are many ways to solve this and many ways to get it wrong. Not working in UTC? Oops, better hope nothing runs during daylight savings changeover. Parent process died but worker finished job? Let's hope the parent isn't responsible for updating the job completion status. Large job is borderline on the timeout? Better hope parent process doesn't restart the job while the worker is still working on it. Network partition? Ut oh. CAP theorem says you're out of luck there (and typically there is at least one network hop between the DB server/controlling process and the system running the workers).

Probably the more straightforward solution is to give each worker an ID and let them update the database with the job they pick up. Then, something robust like systemd, would monitor and restart workers if they fail. When a worker starts, they find any jobs where table.worker_id = myId and then start back on those. But you still have network partitions to worry about. Again, not at all trivial.

[go to top]