- The service mentioned (now called https://webapp.io ) eventually made it into YC (S20) and still uses postgres as its pub/sub implementation, doing hundreds of thousands of messages per day. The postgres instance now runs on 32 cores and 128gb of memory and has scaled well.
- We bolstered Postgres's PUBLISH with Redis pub/sub for high traffic code paths, but it's been nice having ACID guarantees as the default for less popular paths (e.g., webhook handling)
- This pattern only ever caused one operational incident, where a transaction held a lock which caused the notification queue to start growing, and eventually (silently) stop sending messages, starting postgres with statement_timeout=(a few days) was enough to solve this
Previous discussion: https://news.ycombinator.com/item?id=21484215
Happy to answer any questions!
> statement_timeout=(a few days)
wouldnt you want this to be a few seconds or minutes? Maybe I miss the point of setting this to days...
> The postgres instance now runs on 32 cores and 128gb of memory and has scaled well.
Am I the only one?
For an example, it handles stripe webhooks when users change their pricing tier - if you drop that message, users would be paying for something they wouldn't receive.
Anyway, for a server that only does pub/sub with ACID guarantees, those specs are so large that there is certainly a bottleneck before they matter. So it wouldn't be strange if somebody gets one that can't even handle that, it just would mean that there is some issue somewhere we don't see.
What do you say to those who don't want Google to know their usage info?
And if your needs are simpler like in this case then there are dozens of smaller pub/sub/queue systems that you could compare this to.
Kafka does more for streaming data, but doesn't do squat for relational data. You always need a database, but you sometimes can get by without a queuing system.
I've personally written real-time back-of-house order-tracking with rails and postgres pubsub (no redis!), and wrote a record synchronization queuing system with a table and some clever lock semantics that has been running in production for several years now -- which marketing relies upon as it oversees 10+ figures of yearly topline revenue.
Neither of those projects were FAANG scale, but they work fine for what is needed and scale relatively cleanly with postgres itself.
Besides, in a lot of environments corporate will only approve the use of certain tools. And if you already have one approved that does the job, then why not?
That said, it's been running for upwards of 4 years and accumulated an insane number of temperature readings inside and above heating vents (heat source is heat pump)
SELECT count() as count FROM temperatures : msg : Object { _msgid: "421b3777.908118", topic: "SELECT count() as count FROM …", payload: 23278637 }
Ok, I need therapy for my data hoarding - 23 million temp samples is not a good sign :-)
Most senior+ engineers that I know would hear that and recoil. Getting "clever" with concurrency handling in your home-rolled queuing system is not something that coworkers, especially more senior coworkers, will appreciate inheriting, adapting, and maintaining. Believe me.
I get that you're trying to flex some cool thing that you built, but it doesn't really have any bearing on the concept of "most cases" because it's an anecdote. Queuing systems are a thing for a reason, and in most cases, using them makes more sense than writing your own.
I am both a "senior+ engineer" that has inherited such systems and an author of such systems. I think you're overreacting.
Concurrency Control (i.e., "lock semantics") exists for a reason: correctness. Using it for its designed purpose is not horror. Yes, like any tool, you need to use it correctly. But you don't just throw away correctness because you don't want to learn how to use the right tool properly.
I have inherited poorly designed concurrency systems (in the database); yes, I recoiled in horror and did not appreciate it. So you know what I did? I fixed the design, and documented it to show others how to do it correctly.
I have also inherited OOB "Queuing Systems" that could not possibly be correct because they weren't integrated into the DB's built-in and already-used correctness system: Transactions and Concurrency Control. Those were always more horrific than poorly-implemeneted in-DB solutions. Integrating two disparate stores is always more trouble than just fixing one single source.
----
> I get that you're trying to flex some cool thing that you built, but it doesn't really have any bearing on the concept of "most cases" because it's an anecdote. Queuing systems are a thing for a reason, and in most cases, using them makes more sense than writing your own.
I get that you're trying to flex that you use turnkey Queueing Systems, but it doesn't really have any bearing on the concept of "most cases", because all you've presented are assertions without backing. Queuing systems are good, for a specific kind of job, but when you need relational logic you better use one that supports it. And despite what MongoDB and the NoSQL crowd has been screaming hoarsely for the past decade, in most cases, you have relational logic.
But seriously though, postgres's relational logic implementation makes for a very good queueing system for most cases. It's not a hack that's bolted on top. I know that's how quite a few "DBs" are designed and implemented, and maybe you've been burned by too many of them, but Postgres is solid. I've seen it inside and out.
My point is that postgres is a swiss army knife and you and anyone else would be remiss to not fully understand what it is capable of and what you can do with it. Entire classes of software baggage can be eliminated for "most" use cases. One could even argue that reaching for all these extra fancy specialized tools is a premature optimization. Plus, who could possibly argue against having fewer moving parts?
Sometimes I agree with best tool for the job; if the constraints make something a very clear winner; if the difference is marginal for the particular case at hand, I pick what I/we know (I would actually argue that IS the best tool for the job; but in absolute 'what could happen in the future' terms it probably is not).
This is not much load at all, an iPhone running RabbitMQ could process many millions of messages per day. Even 1M messages per day is only 11 messages per second average. i.e. not taxing at all.
But the queue grows precisely because some notifications aren’t getting delivered, right?
- Closed/lock-in vs. Open/lock-free
- Rigid data access pattern vs. Very flexible SQL access
-Managed by AWS vs. Managed by you/your team (although you could use one of those managed Postgres services to reduce ops burden)
- Integrates well with other AWS services (e.g. Lambda, SNS, DynamoDB, etc) vs. No integrations with AWS ecossystem out of the box
I find it amusing that we happily play these AAA gaming experiences that are totally fantastical in their ability to deal with millions of things per frame and then turn around and pretend like hundreds of thousands of things per day is some kind of virtue.
That means you can't have docker and different versions of Java, node and .Net all running in parallel.
You run a single process and Sqlite is a library that allows SQL operations and database to be inbuilt. You 'budget' is like 100 mb of Ram, becauae other stuff has to run too.
All the time-series databases I know are a large, memory hungry hippo, built for distributed compute/kubernetes. Just very different usecase. If one was built with minimalism in mind, then it could be used.
In fact, the article makes a very good point how just doing it in postgres is great, it doesn't really scale (because of ACID), and adapting it for scale after you need it will lead to a better design than what you would do if you started optimizing without any information.
Personally I do see the niceness of having a good pattern implemented using existing technology. Less deployment nonsense, less devops, less complexity, a few tables at most. I've done similar things in the past, it is nice.
For anyone who'd criticize, having complex deployments can be just about as much dev time, AND if implemented well, they can theoretically covert this whole thing to rabbitmq with minimal effort just by swapping the queueing system.
In any case, happy to see people mentioning how using existing simple tech can lead to fairly simple to manage systems, and still solve the problems you're trying to solve.
My experience with SQLite is that it can take you a long ways before needing to look elsewhere.