zlacker

[parent] [thread] 3 comments
1. skytre+(OP)[view] [source] 2021-06-12 20:09:37
> For a simple suggestion, the job queue can live in a schema that is owned by a job-queue service-account role. Nobody other than that user (or a DB superuser) can issue any query in there.

Yes service accounts are fine and dandy but [a] isn't that more hoops borne by choosing the wrong tool for the job (gotta stay mindful of those permissions, you can have an Ansible typo, etc) and [b] as I said, proper queue services would not even allow tampering of enqueued data. For no extra cost other than installing the service. In contrast, your service account falling into a malicious actor is still a threat vector.

> For a more complex — but perhaps useful? — option ...

Now this is just a queue service with extra steps running in a relational DB instead of natively as an OS process. You did cite it as just an option but I don't see why this is an attractive option.

> I didn't see any mention of anything I'd consider an OpEx cost.

It seems to me we have differing definitions of OpEx and, judging by your previous comment, you value this almost exclusively in terms of "the number of ops salaries you have to pay". Even if I play by that definition, I could tell from experience (and also corroborated by other commenters here, in other subthreads of this discussion), that operating a Redis cluster does not need extra warm bodies. The people who monitor that your app responds, your servers aren't running out of memory, could take on the task with little to no additional training.

The experience I want to cite, bearing in mind of course that everyone's mileage varies: in my previous job, customer base of 2M and growing across two countries, we operated and monitored a redundant Redis cluster with an ops team of 3. In my first job I talked about in another subthread here, we managed three high-traffic Redis queues (and a few other smaller-traffic special cases) with a team of 7. In both jobs, we also did active across-the-stack development; we weren't dedicated to monitoring, nor to the components related to the queues.

In fact I would argue running a DB (queue or no queue) is just more complex than running a queue service. Rather telling is the need for "DB Administrators/Specialists" but no such equivalent for Redis or other proper queueing solutions.

> Which is usually also a cost already paid for you in advance when you use an abstract job-queue library. They all know how to expose Prometheus metrics endpoints...

Honestly, this sounds fine but I'm rather wary of the caveats that might be present. How do they do it efficiently? What do I need to keep in mind to keep things efficient? As such, unless you can point me to your preferred job-queue library which does all these wonders, I have to reserve judgment.

Edit: I just remembered mentioning this in another subthread. Even if you have a fantastic library/middleware abstracting all those queue concerns, that ties you in with that library. If someone wants to build integrations into your queue, they have to play by that library. If majority of your app is in Java (and so is the library), and Data Science (who, in this example, uses Python, not improbable to happen) wants to share your data pipeline, if that library isn't available for them, tough luck I guess? More dev time for someone.

And also, whatever underlying DB features or semantics your library might rely on in order to enforce a queue structure, you can't be assured that future versions of your DB would support those features because, you know, your DB isn't really in the business of queues. It opens you up to running an outdated DB version just to keep that queue functionality.

replies(1): >>derefr+z6
2. derefr+z6[view] [source] 2021-06-12 21:21:35
>>skytre+(OP)
> Now this is just a queue service with extra steps running in a relational DB instead of natively as an OS process. You did cite it as just an option but I don't see why this is an attractive option.

Your DB then shares a WAL log with your queue. Meaning a single managed physical replication pipeline for them both. Meaning only one set of leader-election issues to debug, not two. Meaning one canonical way to do geographic high-latency async replication. Meaning disaster recovery brings back a whole-system consistent snapshot state. Etc.

Honestly, if I had my way, every stateful component in the stack would all share a single WAL log. That’s what FoundationDB and the like get you.

> In fact I would argue running a DB (queue or no queue) is just more complex than running a queue service.

Well, yeah, but you usually need a DB. So, if you’re going to be paying the OpEx costs of the DB either way, then you may as well understand it deeply in order to wring the most productive use you can out of each OpEx dollar/man-hour spent.

(I feel the same way about Redis, as it happens: if you need it, and are locking your code into its model anyway, then you may as well take advantage of its more arcane features, like Redis Streams, Lua scripting, etc.)

However, maybe our company is uncommon in how much our service literally is doing fancy complex DB queries that use tons of DB features. We’re a data analytics company. Even the frontend people know arcane SQL here :)

> that ties you in with that library

The difference between what you / apps / abstracting libraries do in Redis, and what they do in an SQL DB, is that in the DB, the shape of everything has to be explained in a vendor-neutral manner: SQL DDL.

Sometimes Redis-based solutions converge on conventional schemas; see e.g. Sidekiq’s informal schema, which several other queuing systems are implemented in terms of. But when they don’t, there’s nothing you can really do — beyond hacking on the libraries involved — to bring them into sync.

In an SQL DB, anything can be adapted into the expected shape of anything else, by defining SQL views. (Heck, in an SQL DB with Redis support, like Postgres with redis_fdw, the Redis data can be adapted into any shape you like using SQL views.)

And that’s further enabled by the fact that the DB had received from the app, through DDL, a schema, that you can examine, manipulate, and refactor; or even synthesize together with other schemas.

> you can't be assured that future versions of your DB would support those features

You can if those features are in the SQL standard. I’ve never heard of a DBMS regressing on its level of SQL standard support.

replies(1): >>skytre+ob
◧◩
3. skytre+ob[view] [source] [discussion] 2021-06-12 22:11:08
>>derefr+z6
> Your DB then shares a WAL log with your queue...

I feel like the understanding we can come to here is that we have differing definitions of necessary complexity.

You did mention you work in data analytics and I have worked and am working in more traditional product-oriented gigs. Everything you mentioned are nice, impressive even, but to me they are toys. I don't need a queue with a WAL stream, or PITR backups. Queues, to me, are inter-process communication mechanisms, or a means to concurrency. In fact, worst case, you can delete the queues or restart Redis even without triggering a disk write (though I note this is another feature that comes free with Redis); it would inconvenience our customers but they can always just retry.

Of all the benefits you mentioned, leader-election is the only one I could make a good case for.

> then you may as well understand it deeply in order to wring the most productive use you can out of each OpEx dollar/man-hour spent.

Understanding it is one thing but using that understanding to create/develop (CapEx) and then maintain (OpEx) an in-DB queue feels like unnecessary costs. CapEx to install Redis is practically nil, and you seem to agree that running it is far simpler OpEx than PG too ("Well, yeah, but...").

Of course, I keep in mind your point about libraries doing the dirty work. But if it all boils down to a third party in the end, the point I'm replying to above is rather moot no? And if it comes to a third dependency anyway, I've already reasoned out my preference for an actual queue solution.

> explained in a vendor-neutral manner: SQL DDL

> You can if those features are in the SQL standard. I’ve never heard of a DBMS regressing on its level of SQL standard support.

While this is an ideal case, my experience so far shows that every major DB package relies on their own tweaks to the SQL standard, and so "If those features are in the SQL standard" turns out to be a pretty big "if" to ask. I don't worry about vendors regressing on SQL standard support but rather that the libraries for DB queues are relying on the non-SQL-standard features/behavior to mock out a full queue functionality---non standard behavior that has no guarantee of consistency across versions.

I mean, if there's a DB queue middleware/library that works across DB vendors, be it Postgres, Oracle, Maria, MySQL (with a reasonable storage engine, perhaps) then that's a convincing argument that the SQL standard can support queuing sufficiently. But otherwise, this counterpoint is a hard sell for me.

replies(1): >>derefr+yh
◧◩◪
4. derefr+yh[view] [source] [discussion] 2021-06-12 23:10:10
>>skytre+ob
> it would inconvenience our customers but they can always just retry.

We have very different things in our queues :) An MQ, to me, is a place to put stuff that needs reliable, at-least-once delivery. Like, say, invoices. User never receives that, bad things happen.

There's very little difference between that type of data, and the data in your DB, while it's "active." It's only once the data in the queue is fully ACKed and no longer relevant that the data behaves more ephemerally than first-class DB state.

> and you seem to agree that running it is far simpler OpEx than PG too

I did not say that. In our stack, we run a Redis cluster on top of Kubernetes, because it's nominally semi-stateless. In reality, it's our biggest SPOF. It fights the container daemon around memory allocations (why does a Redis container that uses 2GB memory steady-state need a 64GB memory limit to finish starting up!?) and has weird dynamics around startup readiness (i.e. Redis started from a large AOF-with-RDB-prolog file will start serving requests for just a moment, then immediately stop again to finish loading. This confuses k8s into thinking it's done with startup-probes and is ready for readiness-probes. Which it's not.)

Redis, when configured to be both Highly Available and durable (which is not your use-case, I know) is really just less-robust than a DB, with less management tooling, fewer docs, fewer people who know how to scale it, fewer third-party solutions pre-adapted to expect it, etc. Because, to paraphrase: Redis isn't really in the business of being a database. And yet that's exactly what you end up with, when you try to use Redis as a reliable at-least-once MQ: a database. Just, a kinda sucky one. (One that doesn't even have a concept of users with different privileges; let alone RBAC!)

And yet, that actually is the niche of Redis. It's the durable, client-server cousin to in-process peer-mesh state-sync solutions like Hazelcast/Mnesia. Which makes it kind of weird that it's not particularly tuned at solving the problems in the one niche unique to it.

(I have nothing against Redis! It has non-unique use-cases where it works great, e.g. when it's being used as "memcached with more data structures", or as a coordination/locking server instead of Zookeeper, etc. A robust reliable MQ it is not. My argument is that your average RDBMS is closer to being a robust, reliable MQ than Redis is. Though, of course, you'll get the best robustness+reliability for that use-case out of a "real" MQ — Kafka or NATS or something.)

> but rather that the libraries for DB queues are relying on the non-SQL-standard features/behavior to mock out a full queue functionality

They don't have to, because everything you need to implement reliable queuing semantics is already part of [the lowest-common-denominator part of the] SQL standard [that everything that calls itself an RDBMS implements.]

These libs have DB-specific adapters for two reasons:

1. There's no such wire protocol as "SQL" — every RDBMS speaks its own wire protocol to convey your SQL statements to the database and convey result-sets back; and so, unless you're using a language with a common API abstraction for RDBMS client libs to be built against (e.g. JDBC), then each such client lib ends up presenting its own distinct API to callers, which needs a separate adapter written for it.

2. with SQL, reliable queuing is "easy", but knowing about enqueued items (with as minimal an overhead as possible, scanning as few additional items as possible) is hard. SQL says nothing about how to get push-notification-to-the-client type information out of the DB, and so every DBMS implements its own mechanism for this. These are pure optimizations using pure-ephemeral code, though; the schema of the queue in the DB remains the same either way, such that the data could be exported from one RDBMS to a .sql file, imported from that .sql file to a different RDBMS, and the only thing needed on the client side would be switching adapters.

(In theory, presuming a common wire-protocol abstraction interface lib like JDBC, you could have a "standard SQL" adapter. It'd just be annoyingly low-throughput. Like using HTTP polling when you could be using WebSockets.)

[go to top]