zlacker

a few years back i worked on an enterprisey project that used postgres as a database, along with rabbitmq and celery for job processing.

of _course_ the system architecture had to have a job queue and it had to be highly available (implemented with a rabbitmq cluster)

what we learned after a few months in production was the only time the rabbitmq cluster had outages was when it got confused* and thought (incorrectly) there was a network partition, and flipped into partition recovery mode, causing a partial outage until production support could manually recover the cluster

the funny thing about this is that our job throughput was incredibly low, and we would have had better availability if we had avoided adding the temperamental rabbitmq cluster and instead implemented the job queue in our non-HA postgres instance that was already in the design --- if our postgres server went down then the whole app was stuffed anyway!

* this was a rabbitmq defect when TLS encryption was enabled and very large messages were jammed through the queue -- rabbitmq would be so busy encrypting / decrypting large messages that it'd forget to heartbeat, then it'd timeout and panic that it hadn't got any heartbeats, then assume that no heartbeats implied a network partition, and cause an outage, needing manual recovery. i think rabbitmq fixed that a few years back