zlacker

Author here! A few updates since this was published two years ago:

- The service mentioned (now called https://webapp.io ) eventually made it into YC (S20) and still uses postgres as its pub/sub implementation, doing hundreds of thousands of messages per day. The postgres instance now runs on 32 cores and 128gb of memory and has scaled well.

- We bolstered Postgres's PUBLISH with Redis pub/sub for high traffic code paths, but it's been nice having ACID guarantees as the default for less popular paths (e.g., webhook handling)

- This pattern only ever caused one operational incident, where a transaction held a lock which caused the notification queue to start growing, and eventually (silently) stop sending messages, starting postgres with statement_timeout=(a few days) was enough to solve this

Previous discussion: https://news.ycombinator.com/item?id=21484215

Happy to answer any questions!

replies(10): >>anonu+A >>boomsk+N >>ledger+hi >>moneyw+Pn >>fnord7+ns >>dirkt+Pu >>saberi+lX >>almog+x11 >>matt20+jI1 >>Justsi+8P1

>>colinc+(OP)
Thanks for the great blog post - still relevant after a few years!

> statement_timeout=(a few days)

wouldnt you want this to be a few seconds or minutes? Maybe I miss the point of setting this to days...

replies(1): >>colinc+X

>>colinc+(OP)
> doing hundreds of thousands of messages per day

> The postgres instance now runs on 32 cores and 128gb of memory and has scaled well.

Am I the only one?

replies(9): >>colinc+J1 >>johnis+82 >>SahAss+X6 >>dreyfa+v8 >>michel+ie >>Clumsy+Wg >>tyingq+0q >>tmount+2r >>colinc+oz

>>anonu+A
Didn't want to deal with ramifications of statement timeouts in a complex system, the failure mode mentioned (queue filling up) happened on the scale of 6 weeks, so it was very cheap operationally to set this timeout to some high value.

replies(1): >>runeks+A91

>>boomsk+N
Such a server is 400$/mo, a backend developer that can confidently maintain kafka in production is significantly more expensive!

replies(6): >>macksd+62 >>rockwo+k6 >>lern_t+Yb >>threes+uc >>remram+Ar >>pts_+jP

>>colinc+J1
I think the point of interest was 32 cores to handle what sounds like 10 messages per second at most. That's not really a ton of throughput... It's certainly a valid point that an awful lot of uses cases don't need Twitter-scale firehoses or Google-size Hadoop clusters.

replies(2): >>colinc+l2 >>rowanG+lY6

>>boomsk+N
Are you implying that given the specs, hundreds of thousands of messages per day is not good enough? I think you are, or at least that is what I was thinking myself.

replies(1): >>marcos+F4

>>macksd+62
Ah, the database does a lot more than just pub/sub - especially since the high traffic pub/sub goes through redis. I guess my point was that we never regretted setting up postgres as the "default job queue" and it never required much engineering work to maintain.

For an example, it handles stripe webhooks when users change their pricing tier - if you drop that message, users would be paying for something they wouldn't receive.

>>johnis+82
Only for hundreds of thousands of messages per day, that's way too big of a server. But if you look on the rest of the thread, it doesn't do only that.

Anyway, for a server that only does pub/sub with ACID guarantees, those specs are so large that there is certainly a bottleneck before they matter. So it wouldn't be strange if somebody gets one that can't even handle that, it just would mean that there is some issue somewhere we don't see.

replies(1): >>edmund+Mp

>>colinc+J1
Fwiw I don't know the shape of the data, but I feel like you could do this with Firebase for a few bucks a month...

replies(1): >>daenz+nb

>>boomsk+N
I assume that is their main database for everything, not just for pub/sub. One of the big benefits of doing it that way is that you have proper transaction handling across jobs and their related data.

replies(1): >>jbvers+6e

>>boomsk+N
It’s such a low throughput requirement I think even bitcoin could support it.

replies(1): >>ivalm+pb

>>rockwo+k6
you 100% could, and this thread feels like the twilight zone with how many people are advocating for using a rdbms for (what seems like) most peoples queuing needs.

replies(3): >>tata71+Sb >>tomc19+Qq >>foepys+vy

>>dreyfa+v8
No, this is prob still spiky enough to have more than 7 transactions per second, that’s too much for Bitcoin.

>>daenz+nb
Does Firebase offer self-hosting these days?

What do you say to those who don't want Google to know their usage info?

replies(1): >>pricci+Pc

>>colinc+J1
It probably fits within the free tier limits of a managed pubsub service.

>>colinc+J1
But Kafka does significantly more.

And if your needs are simpler like in this case then there are dozens of smaller pub/sub/queue systems that you could compare this to.

replies(3): >>speed_+4l >>moneyw+Un >>akvadr+ML

>>tata71+Sb
Checkout supabase.com. It is based on postgres.

>>SahAss+X6
Come on man… you can run the whole thing off if a few Gb instance. Such a huge instance should be able to do about 100k a second!

replies(4): >>anifor+Dm >>stilli+3u >>akvadr+FL >>michae+Ij2

>>boomsk+N
To cherry pick two details of the post and insinuate something about it?

No.

>>boomsk+N
Yeah, every home IoT hub processes more messages than that with less thsn raspberri pi worth of compute

replies(1): >>earley+6x

>>colinc+(OP)
I've always been curious, what kind of latency do you see between an insert, and when the notify goes out over the channel?

replies(1): >>colinc+6w1

>>threes+uc
Limit the types of server used to reduce system complexity. If you can have all your business state in the same place, ops are much easier.

Kafka does more for streaming data, but doesn't do squat for relational data. You always need a database, but you sometimes can get by without a queuing system.

>>jbvers+6e
Potential and actual usage aren't related. They might be having a lot of records and read/writes but maybe the actual pub/sub isn't that intensive. They seem to be using the same DB for everything

>>colinc+(OP)
Silly question but how does this compare to SQS? More cost friendly I assume?

replies(1): >>rmbyrr+Rh1

>>threes+uc
Briefly what are some mandatory kafka use cases?

>>marcos+F4
Is your point that the server has room to grow? Or that you just “ain’t impressed by that”?

replies(1): >>marcos+Dz1

>>boomsk+N
I imagine their scaling problem isn't messages/day, it's probably lots of concurrent, persistent connections. And I don't think a connection pooler would work with this job queue setup.

>>daenz+nb
Dude you are seriously underestimating postgres' versatility. It does so many different things, and well!

replies(1): >>daenz+du

>>boomsk+N
People are jumping on this. Question is—do the resource requirements outlined align with usage you described, or is that combined workload? By combined workload, I mean working set plus messaging. It’s not a useful exercise to criticize a service that’s multifaceted based on a single use case. Full disclosure—-not a Postgres user, nor am I invested in the tech.

>>colinc+J1
It's that much on a popular cloud platform, you can buy this for 3-4 times that amount and use it for years.

replies(2): >>tluybe+6G >>discor+qP

>>colinc+(OP)
I'd be curious to know what the drawbacks of using PG for a pub/sub server are.

>>jbvers+6e
Be careful not to confuse average load with peak instantaneous load. Bursty workloads are the bane of capacity planners everywhere.

>>tomc19+Qq
I'm not underestimating anything. I am advocating for the right tool for the job. I have a hard time believing, despite the skewed sample size in this thread, that most people think using postgres as a message queue for most cases makes the most sense.

replies(3): >>tomc19+Nw >>qetern+Hk1 >>michae+Zj2

>>colinc+(OP)
What are the options to use Postgres pub/sub with Java? Because the usual Java libraries don't seem to support the pub/sub functionality well, you have to actively poll when you want to subscribe.

replies(2): >>paulry+9l1 >>oftenw+7x1

>>daenz+du
What is your idea of 'most cases'?

I've personally written real-time back-of-house order-tracking with rails and postgres pubsub (no redis!), and wrote a record synchronization queuing system with a table and some clever lock semantics that has been running in production for several years now -- which marketing relies upon as it oversees 10+ figures of yearly topline revenue.

Neither of those projects were FAANG scale, but they work fine for what is needed and scale relatively cleanly with postgres itself.

Besides, in a lot of environments corporate will only approve the use of certain tools. And if you already have one approved that does the job, then why not?

replies(1): >>daenz+qB

>>Clumsy+Wg
I certainly appreciate the sentiment though I'm pretty sure I don't have the same reliability and uptime guarantees on my little Rpi3/MQTT/NodeRed/SQLite/ESP8266 home system :-)

That said, it's been running for upwards of 4 years and accumulated an insane number of temperature readings inside and above heating vents (heat source is heat pump)

SELECT count() as count FROM temperatures : msg : Object { _msgid: "421b3777.908118", topic: "SELECT count() as count FROM …", payload: 23278637 }

Ok, I need therapy for my data hoarding - 23 million temp samples is not a good sign :-)

replies(1): >>brodou+7k1

>>daenz+nb
Why should I rely on yet another microservice when I have PostgreSQL right there?

replies(1): >>daenz+CC

>>boomsk+N
I should've clarified, the database handles more than just the "regular" pub/sub, some of the tables have over a billion rows.

replies(1): >>boomsk+m01

>>tomc19+Nw
>some clever lock semantics

Most senior+ engineers that I know would hear that and recoil. Getting "clever" with concurrency handling in your home-rolled queuing system is not something that coworkers, especially more senior coworkers, will appreciate inheriting, adapting, and maintaining. Believe me.

I get that you're trying to flex some cool thing that you built, but it doesn't really have any bearing on the concept of "most cases" because it's an anecdote. Queuing systems are a thing for a reason, and in most cases, using them makes more sense than writing your own.

replies(3): >>pritam+pH >>tomc19+fI >>fanf2+nT

>>foepys+vy
Everything is a nail, why should I use anything but this hammer?

replies(3): >>pritam+FH >>tluybe+gI >>Dowwie+EY

>>remram+Ar
Or rent it for a lot less at a traditional hosting company.

>>daenz+qB
> Most senior+ engineers that I know would hear that and recoil. Getting "clever" with concurrency handling in your home-rolled queuing system is not something that coworkers, especially more senior coworkers, will appreciate inheriting, adapting, and maintaining. Believe me.

I am both a "senior+ engineer" that has inherited such systems and an author of such systems. I think you're overreacting.

Concurrency Control (i.e., "lock semantics") exists for a reason: correctness. Using it for its designed purpose is not horror. Yes, like any tool, you need to use it correctly. But you don't just throw away correctness because you don't want to learn how to use the right tool properly.

I have inherited poorly designed concurrency systems (in the database); yes, I recoiled in horror and did not appreciate it. So you know what I did? I fixed the design, and documented it to show others how to do it correctly.

I have also inherited OOB "Queuing Systems" that could not possibly be correct because they weren't integrated into the DB's built-in and already-used correctness system: Transactions and Concurrency Control. Those were always more horrific than poorly-implemeneted in-DB solutions. Integrating two disparate stores is always more trouble than just fixing one single source.

----

> I get that you're trying to flex some cool thing that you built, but it doesn't really have any bearing on the concept of "most cases" because it's an anecdote. Queuing systems are a thing for a reason, and in most cases, using them makes more sense than writing your own.

I get that you're trying to flex that you use turnkey Queueing Systems, but it doesn't really have any bearing on the concept of "most cases", because all you've presented are assertions without backing. Queuing systems are good, for a specific kind of job, but when you need relational logic you better use one that supports it. And despite what MongoDB and the NoSQL crowd has been screaming hoarsely for the past decade, in most cases, you have relational logic.

>>daenz+CC
Postgres happens to be a very good hammer, thank you very much. You should try it sometime.

But seriously though, postgres's relational logic implementation makes for a very good queueing system for most cases. It's not a hack that's bolted on top. I know that's how quite a few "DBs" are designed and implemented, and maybe you've been burned by too many of them, but Postgres is solid. I've seen it inside and out.

>>daenz+qB
Well, you'd have to see it before you judge. It's super simple, like 5 or 10 lines total. Handles 1000x+ the traffic it sees. In any case concurrency is nothing to be afraid of. Do they not teach dining philosophers any more?

My point is that postgres is a swiss army knife and you and anyone else would be remiss to not fully understand what it is capable of and what you can do with it. Entire classes of software baggage can be eliminated for "most" use cases. One could even argue that reaching for all these extra fancy specialized tools is a premature optimization. Plus, who could possibly argue against having fewer moving parts?

>>daenz+CC
Make every system as complex as you can with tech you are not really familiar with is a good plan for your small team? Under a 100 people, your company does not have 100 devops etc to make sure all these 'best of breed' tools actually managed properly in production. If a service on top of postgres dies, I will find out why very quickly; on Kafka, even though I have used it a bunch of times, I usually have no clue; just restart and pray. Why would I force myself to use another tool when postgres actually works well enough? Resume driven?

Sometimes I agree with best tool for the job; if the constraints make something a very clear winner; if the difference is marginal for the particular case at hand, I pick what I/we know (I would actually argue that IS the best tool for the job; but in absolute 'what could happen in the future' terms it probably is not).

>>jbvers+6e
Does postgres scale that well? I would be interested in case studies as I've not seen much achieving beyond 10k records per second.

>>threes+uc
I would say postgres does much more. What use case can only Kafka handle?

>>colinc+J1
That's the job of a DevOps engineer not a backend developer attempted to be overworked.

>>remram+Ar
Got a 128gb 32 core xeon workstation sitting under my desk off eBay and it was $400

replies(1): >>isbvho+F41

>>daenz+qB
I guess the clever lock semantics are SKIP LOCKED, which is designed to support efficient queues. The cleverness is inside PostgreSQL rather than in the application, other than the cleverness of knowing about this feature. https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock...

replies(1): >>tomc19+pN1

>>colinc+(OP)
"hundreds of thousands of messages per day"

This is not much load at all, an iPhone running RabbitMQ could process many millions of messages per day. Even 1M messages per day is only 11 messages per second average. i.e. not taxing at all.

replies(1): >>bob102+nl1

>>daenz+CC
I think you're helping bring balance to the enthusiasm here for using Postgres as a multi-purpose tool. However, there is a lot of room for you and the advocates favoring Postgres to both be right about tooling. I adopted RabbitMQ because I decided I didn't want to grow into needing it by dealing with many of the problems that motivated engineers to bring RabbitMQ into existence. However, I probably would have been fine with Postgres-pubsub, or Redis-pubsub/streams, both databases that I already used for their general purpose and have established capabilities for messaging. I noticed your earlier agreement with the person who mentioned using Firebase, and Firebase is yet another multi-purpose tool good enough at many things but still not better than the customized domain systems. If you agree with the claim for Firebase, others can now agree about Supabase. It's all Postgres beneath, though.

replies(1): >>nicobu+h81

>>colinc+oz
Thank you. That makes a lot more sense and explains the value of the approach much better.

>>colinc+(OP)
What was the isolation level used when that incident occurred?

replies(1): >>colinc+ew1

>>discor+qP
That's not exactly a setup suitable for reliable production usage though.

>>Dowwie+EY
Agree with your point about multiple tools being good enough, but IMO firebase is not one of them. In my experience despite it claiming to be excellent at scaling, it performs worse than even a small Postgres instance. It’s good at the “real-time subscriptions”, but that’s about it.

replies(1): >>Dowwie+Bl1

>>colinc+X
So, just to make sure I understand correctly: notifications are delivered while the notification queue size is increasing (due to the transaction holding a lock), and it doesn’t become a problem until the queue size reaches its maximum, at which point it causes dropped notifications?

But the queue grows precisely because some notifications aren’t getting delivered, right?

replies(1): >>colinc+Uv1

>>moneyw+Pn
SQS vs "Postgres Queue", I think mainly:

- Closed/lock-in vs. Open/lock-free

- Rigid data access pattern vs. Very flexible SQL access

-Managed by AWS vs. Managed by you/your team (although you could use one of those managed Postgres services to reduce ops burden)

- Integrates well with other AWS services (e.g. Lambda, SNS, DynamoDB, etc) vs. No integrations with AWS ecossystem out of the box

>>earley+6x
Curious as to why you aren’t tracking that with a time series database?

replies(2): >>Clumsy+Cr1 >>earley+lI2

>>daenz+du
No, you are misunderstanding. People are saying Postgres does message broking quite well. That makes it the right tool for the job for many people. You have a hard time believing it but people who have actually done it are saying otherwise. This is your misunderstanding.

>>dirkt+Pu
This may depend on the JDBC driver support Listen/Notify. Though if queue traffic is relatively steady then maybe polling isn't so bad?

>>saberi+lX
I've built software that can process millions of messages per second on a single thread.

I find it amusing that we happily play these AAA gaming experiences that are totally fantastical in their ability to deal with millions of things per frame and then turn around and pretend like hundreds of thousands of things per day is some kind of virtue.

>>nicobu+h81
Noted. Thanks for sharing your experiences with that. We need to hear more about lackluster investments in tech.

>>brodou+7k1
The IoT hubs are an embedded system, built with a minimal memory footprint and overhead, 512 mb of ram is typical, sometimes less. Here is an example: https://www.gl-inet.com/products/gl-s1300/

That means you can't have docker and different versions of Java, node and .Net all running in parallel.

You run a single process and Sqlite is a library that allows SQL operations and database to be inbuilt. You 'budget' is like 100 mb of Ram, becauae other stuff has to run too.

All the time-series databases I know are a large, memory hungry hippo, built for distributed compute/kubernetes. Just very different usecase. If one was built with minimalism in mind, then it could be used.

>>runeks+A91
Since it's pub/sub, you just need one misbehaving client that LISTENs in a transaction to have problems, the other clients can still receive and process the NOTIFY event

>>ledger+hi
10ms or so

>>almog+x11
The default postgres one, serializable (if I remember correctly?)

replies(1): >>oftenw+Zm2

>>dirkt+Pu
https://impossibl.github.io/pgjdbc-ng/docs/0.8.9/user-guide/...

or

https://github.com/pgjdbc/r2dbc-postgresql#listennotify

>>edmund+Mp
I guess the point is that the scale is actually not that large, but that's perfectly ok because most problems will never need that large scale either.

In fact, the article makes a very good point how just doing it in postgres is great, it doesn't really scale (because of ACID), and adapting it for scale after you need it will lead to a better design than what you would do if you started optimizing without any information.

>>colinc+(OP)
I very happily use this technique and I believe I found out about it from your original blog post. Thanks for the original writeup and for the update on how it's going a few years in!

>>fanf2+nT
Yup, exactly that

>>colinc+(OP)
I assume you use polling workers looking for the next job to grab for themselves?

Personally I do see the niceness of having a good pattern implemented using existing technology. Less deployment nonsense, less devops, less complexity, a few tables at most. I've done similar things in the past, it is nice.

For anyone who'd criticize, having complex deployments can be just about as much dev time, AND if implemented well, they can theoretically covert this whole thing to rabbitmq with minimal effort just by swapping the queueing system.

In any case, happy to see people mentioning how using existing simple tech can lead to fairly simple to manage systems, and still solve the problems you're trying to solve.

>>jbvers+6e
Whose to say it can't?

>>daenz+du
There is also the issue of having to have up to n experts for n different "best tools". Programmer/devops time is expensive; the tool choice is not the only (and often the least) cost to consider.

>>colinc+ew1
The default isolation level in postgres is read committed.

>>brodou+7k1
It's hobby diversion so minimal effort is a factor. That and that SQL query comes back in seconds. The initial experimentation was with ESP8266's and the MQTT/NodeRed/SQLite played a supporting roll.

My experience with SQLite is that it can take you a long ways before needing to look elsewhere.

>>macksd+62
It said nothing about the distribution of traffic. It might well be thousands and thousands of pub sub messages at some point of the day and 0 for others.