zlacker

I've done even simpler without locks (as no transaction logic), where I select a row, and then try to update a field about it being taken. If 1 row is affected, it's mine. If 0, someone else did it before me and I select a new row.

I've used this for tasks at big organizations without issue. No need for any special deployments or new infra. Just spin up a few worker threads in your app. Perhaps a thread to reset abandoned tasks. But in three years this never actually happened, as everything was contained in try/catch that would add it back to the queue, and our java app was damn stable.

replies(7): >>andrel+D >>bushba+B1 >>mbb70+Qb >>klysm+Uf >>samus+jg >>valzam+Vn >>refibr+eo

>>matsem+(OP)
I guess you update it with the assigned worker id, where the "taken by" field is currently null? Does it mean that workers have persistent identities, something like an index? How do you deal with workers being replaced, scaled down, etc?

Just curious. We maintained a custom background processing system for years but recently replaced it with off the shelf stuff, so I'm really interested in how others are doing similar stuff.

replies(2): >>matsem+o1 >>calrai+0h

>>andrel+D
No, just update set taken=1. If it was a change to the row, you updated it. If it wasn't, someone updated before you.

Our tasks were quick enough so that all fetched tasks would always be able to be completed before a scale down / new deploy etc, but we stopped fetching new ones when the signal came so it just finished what it had. I updated above, we did have logic to monitor if a task got taken but never got a finished status, but I can't remember it ever actually reporting on anything.

replies(3): >>fbdab1+S1 >>fsnipe+34 >>SahAss+X4

>>matsem+(OP)
You could even use a timestamp for handling what if this task was never finished by the worker who locked the row.

>>matsem+o1
I would set the taken field to a timestamp. Then you could have a cleanup job that looks for any lingering jobs aged past a reasonable timeout and null out the field.

replies(3): >>tylerg+K2 >>Izkata+KD >>magica+O11

>>fbdab1+S1
it wont work with a timestamp because each write will have an affected row of 1 beacuse the writes happen at different times. setting a boolean is static

replies(3): >>jayd16+h4 >>twic+39 >>AdamJa+ml

>>matsem+o1
You can combine this "update" with a "where taken = 0" to directly skip taken rows.

>>tylerg+K2
You can do something like UPDATE row SET timeout = NOW() WHERE NOW() - taskTimeout > row.timestamp. You're not stuck with comparing bools.

>>matsem+o1
That is the sort of thing that bites you hard when it bites. It might run perfectly for years but that one period of flappy downtime at a third party or slightly misconfigured DNS will bite you hard.

replies(1): >>matsem+m8

>>SahAss+X4
But compared to our rabbit setup where I work now, it was dead stable. No losing tasks or extra engineering effort on maintaining yet another piece of tech. Our rabbit cluster acting up has led to multiple disasters lately.

replies(1): >>SahAss+ca

>>tylerg+K2
update tasks set taken_timestamp = now() where task_id = ? and taken_timestamp is null

>>matsem+m8
Agreed, I've had my own rabbit nightmares. But setting up a more robust queue on postgresql is easy, so you can easily gain a lot more guarantees without more complexity.

>>matsem+(OP)
I've done the same with MongoDB with findOneAndModify, simple and solid

replies(2): >>ttfkam+2s >>dminor+Pv

>>matsem+(OP)
With what transaction isolation level?

>>matsem+(OP)
I recently got introduced to this system at work, and also built a new job using it. It works fine, but since I had to implement work stealing to deal with abandoned jobs in a timely manner, I wouldn't dare to use it for actions that absolutely must not happen twice.

replies(1): >>A1kmm+x11

>>andrel+D
I've done this successfully with a web service front that retrieves jobs to send to workers for processing, by using a SQL table queue. That web service ran without a hitch for a long time, serving about 10 to 50 job consumers for fast and highly concurrent queues.

My approach was:

- Accept the inbound call

- Generate a 20 character random string (used as a signature)

- Execute a sql query that selects the oldest job without a signature and write the signature, return the primary key of the job that was updated.

- If it errors for any reason, loop back and attempt again, but only 10 times, as some underlying issue exists (10 collisions is statistically improbable for my use case)

- Read the primary key returned by that sql query and read it, comparing it's signature to my random one.

- If a hit, return the job to the caller

- If a miss, loop back and start again, incrementing attempts by 1.

The caller has to handle the possibility that a call to this web service won't return anything, either due to no jobs existing, or the collision/error threshold being reached.

In either case, the caller backs for it's configured time, then calls again.

Callers are usually in 'while true' loops, only existing if they get an external signal to close or an uncontrolled crash.

If you take this approach, you will have a function or a web service that converts the SQL table into a job queue service. When you do that, you can build metrics on the amount of collisions you get while trying to pull and assign jobs to workers.

I had inbuilt processes that would sweep through jobs that were assigned (had a job signature) and weren't marked as complete, it actioned those to handle the condition of a crashed worker.

There are many many other services the proper job queues offer, but that usually means more dependencies, and code libraries / containers, so just build in the functionality you need.

If it is accurate, fast enough, and stable, you've got the best solution for you.

/edited for formatting

>>tylerg+K2
update row set taken=true,taken_by=my_id,taken_at=now() where taken is false;

>>matsem+(OP)
The reason why you want to use skip locked is so that Postgres can automatically skip rows that are being concurrently accessed for updating the "status". You are right, if you update a "status" field you don't really need to worry about advisory locks and skipping rows that are locked but it still helps with performance if you have a decent amount of concurrent consumers polling the table.

>>matsem+(OP)
PSA: This is a read-modify-write pattern, thus it is not safe under concurrency unless a transaction isolation level of SERIALIZABLE is specified, or some locking mechanism is used (select for update etc).

replies(2): >>mjb+mx >>derede+kQ

>>mbb70+Qb
I think you mean findOneAndUpdate, and while simple, I wouldn't call it solid.

https://stackoverflow.com/a/76821755

>>mbb70+Qb
Agenda uses this, and we found the hard way on mongo 4 that it can lead to mongo spinning the CPU at 100% if it gets too many at once. No idea if they've fixed it in later versions.

>>refibr+eo
This should be safe under SI (other than the ABA issue, which isn't even fixed with serializable). The update forces a W-W conflict, which is sufficient to make the behavior serializable under SI (and therefore, I think but am not sure, PG's RR level too).

>>fbdab1+S1
We do it with two columns, one is an integer identifying which process took the job and the second is the timestamp for when it was taken.

>>refibr+eo
The part about checking the number of affected rows hints at using `UPDATE ... WHERE ...` which should act as an atomic CAS regardless of isolation level.

Edit: To clarify, I mean `SELECT id WHERE used = 0` followed by `UPDATE ... SET used = 1 WHERE id = ... AND used = 0`

replies(2): >>matsem+JR >>Richie+iT

>>derede+kQ
This is spot on! We let the db provide the atomics.

>>derede+kQ
This works fine as long as you’re happy to do the same task multiple times. I.e. the task is idempotent and cheap.

replies(1): >>kdps+301

>>Richie+iT
I don't get it :(. Why could the same task be executed more than once? From my understanding, if the UPDATE is atomic, only one worker will be able to set `used = 1`. If the update statement is not successful (affected != 1), then the worker should drop the task and do another select.

replies(1): >>klause+e41

>>samus+jg
Exactly-once is only meaningfully possible if you have a rollback for tasks of unknown completion state - for example if the task involves manipulating the same database as the one controlling the task execution. Otherwise, it becomes the (impossible to solve) two-generals problem between updating the task status and performing the task.

replies(1): >>samus+qi1

>>fbdab1+S1
We have a "status flag" column which is either Available, Locked or Processed (A, L and P), an Updated column with a timestamp of when it was last updated, and a Version counter.

When grabbing a new message it selects "Available or (Locked with Updated timestamp older than configured timeout)". If successful it immediately tries to set the Locked status, Updated timestamp and bumps the Version counter, where the previous values of Status and Version has to match. If the update fails it retries getting a new message.

If the Version counter is too high, it moves the message to the associated dead-letter table, and retries getting a new message.

This isn't for high performance. I tested it and got 1000 messages/sec throughput with handful of producers and consumers against test db instance (limited hardware), which would be plenty for us.

I wrote it to be simple and so we could easily move to something AMPQ'ish like RabbitMQ or Azure Service Bus when needed. Overall quite easy to implement and has served us well so far.

>>kdps+301
With a transaction isolation level below SERIALIZABLE you can have two transactions that both read the old row (with `used = 0`) at the time they perform the update (but before they commit the transaction). In that case, both transactions will have performed an update (rows affected = 1).

Why would both transactions see `used = 0`? The DB server tries to isolate transactions and actively hides effects of other transactions that have not committed yet.

replies(1): >>singro+2b1

>>klause+e41
This is not true in postgres. When the second transaction tries to update the row, it will wait for the first transaction to commit first and then recheck the WHERE.

https://www.postgresql.org/docs/current/transaction-iso.html...

>>A1kmm+x11
Full agree here.

There is actually another possibility: there must be a way to check whether the receiving system has received the message. But this only works if there are no "rogue" senders.