But once you get even just up to say, 40 messages per second with 100 worker processes, you're now up to 4000 updates per second just to see which worker got to claim which job, and up from there becomes untenable.
That’s what our workload is like for our SaaS code analysis platform. We create a few tasks (~10 max) for every customer submission (usually triggered by a code push). We replaced Kafka with a PostgreSQL table a couple of years ago.
We made the schema, functions, and Grafana dashboard open source [0]. I think it’s slightly out-of-date but mostly the same as what we have now in production, and has been running perfectly.
of _course_ the system architecture had to have a job queue and it had to be highly available (implemented with a rabbitmq cluster)
what we learned after a few months in production was the only time the rabbitmq cluster had outages was when it got confused* and thought (incorrectly) there was a network partition, and flipped into partition recovery mode, causing a partial outage until production support could manually recover the cluster
the funny thing about this is that our job throughput was incredibly low, and we would have had better availability if we had avoided adding the temperamental rabbitmq cluster and instead implemented the job queue in our non-HA postgres instance that was already in the design --- if our postgres server went down then the whole app was stuffed anyway!
* this was a rabbitmq defect when TLS encryption was enabled and very large messages were jammed through the queue -- rabbitmq would be so busy encrypting / decrypting large messages that it'd forget to heartbeat, then it'd timeout and panic that it hadn't got any heartbeats, then assume that no heartbeats implied a network partition, and cause an outage, needing manual recovery. i think rabbitmq fixed that a few years back