zlacker

[return to "Choose Postgres queue technology"]
1. mianos+It[view] [source] 2023-09-25 01:21:25
>>bo0tzz+(OP)
Skype used postgres as queue with a small plugin to process all their CDR many years ago. I have no idea if it used these days but it was 'web scale', 10 years ago. Just working, while people on the internet argued about using a database as a queue is an anti-pattern.

Having transactions is quite handy.

https://wiki.postgresql.org/wiki/SkyTools

I did a few talks on this at Sydpy as I used it at work quite a bit. It's handy when you already have postgresql running well and supported.

This said, I'd use a dedicated queue these days. Anything but RabbitMQ.

◧◩
2. reuben+oz[view] [source] 2023-09-25 02:24:53
>>mianos+It
> Anything but RabbitMQ.

Would you mind elaborating on this? I'd be happy for others to chime in with their experiences/opinions, too.

◧◩◪
3. phamil+fI[view] [source] 2023-09-25 04:05:52
>>reuben+oz
I can share our experience with RabbitMQ/SQS/Sidekiq. Our two major issues have been around the retry mechanism and resource bottlenecks.

The key retry problem is "What happens when a worker crashes?".

RabbitMQ solves this problem by tying "unacknowledged messages" to a tcp connection. If the connection dies, the in-flight messages are made available to other connections. This is a decent approach, but we hit a lot of issues with bugs in our code that would fail to acknowledge a message and the message would get stuck until that handler cycled. They've improved this over the past year or so with consumer timeouts, but we've already moved on.

The second problem we hit with RabbitMQ was that it uses one-erlang-process-per-queue and we found that big bursts of traffic could saturate a single CPU. There are ways to use sharded queues or re-architect to use dynamically created queues but the complexity led us towards SQS.

Sidekiq solves "What happens when a worker crashes?" by just not solving it. In the free version, those jobs are just lost. In Sidekiq Pro there are features that provide some guarantees that the jobs will not be lost, but no guarantees about when they will be processed (nor where they will be processed). Simply put, some worker sometime will see the orphaned job and decide to give it another shot. It's not super common, but it is worse in containerized environments where memory limits can trigger the OOM killer and cause a worker to die immediately.

The other issue with Sidekiq has been a general lack of hard constraints around resources. A single event thread in redis means that when things go sideways it breaks everything. We've had errant jobs enqueued with 100MB of json and seen it jam things up badly when Sidekiq tries to parse that with a lua script (on the event thread). While it's obvious that 100MB is too big to shove into a queue, mistakes happen and tools that limit the blast radius add a lot of value.

We've been leaning heavily on SQS the past few years and it is indeed Simple. It blocks us from doing even marginally dumb things (max message size of 256gb). The visibility timeout approach for handling crashing workers is easy to reason about. DLQ tooling has finally improved so you can redrive through standard aws tools. There are some gaps we struggle with (e.g. firing callbacks when a set of messages are fully processed) but sometimes simple tools force you to simplify things on your end and that ends up being a good thing.

◧◩◪◨
4. chucke+B61[view] [source] 2023-09-25 08:58:09
>>phamil+fI
SQS limits you further in other ways. For instance, scheduled tasks are capped to 15m (delaySconds knob), so you'll be stuck when implementing the "cancel account if not verified in 7 days" workflow. You'll either reenqueue a message every 15m until its ready (and eat your the SQS costs), or build a bespoke solution only for scheduled tasks using some other store (the database usually) and another polling loop (at a fraction of the quality of any other OSS tool). This is a problem well solved by sidekiq, despite the other drawbacks you mention.

Bottom line, there is no silver bullet.

◧◩◪◨⬒
5. mtlgui+tf1[view] [source] 2023-09-25 10:32:51
>>chucke+B61
If you wanted to handle this scenario with the serverless AWS stack, my recommendation would be to push records to Dynamo with TTLs and then when they pop have a Lambda push them onto the queue. Would cost almost nothing to do this. If you had 10 million requests a month your Lambda cost would be ~$150 to run this (depending on duration, but just pushing to a queue should be quick). Dynamo would be another ~$50 to run, depending how big your tasks are.

Granted now you need 3 services instead of 1. I personally don't find the maintenance cost particularly high for this architecture, but it does depend on what your team is comfortable with.

[go to top]