Understanding Kafka with Factorio (2019)

>>pul+(OP)
Must be something about Kafka to attract these kind of explanations. Another one few months back was a children's book on Kafka [1] . For me it just look like solution looking for actual problems.

I wonder if Kafka represents an existential angst in these Kubernetized Microservice times. Or is it more simply I am just too dumb to learn and use this shit correctly.

1. https://news.ycombinator.com/item?id=27541339

>>geodel+nkb
Sometimes I wonder if I'm the crazy one. Kafka seems to me to be the only sensible foundational datastore out there: it can maintain and propagate a log with all the properties you would want a datastore to have. Relational database seem to be a crazily overengineered solution in search of a problem, with incredibly poor reliability properties (essentially none of them are true master-master HA out of the box, and they tend to require significant compromises to make them so) to boot.

>>lmm+ltb
> Relational database seem to be a crazily overengineered solution in search of a problem

I mean, whoever in their right mind would want to:

- have a snapshot of data

- query data, including ad-hoc querying

- query related data

- have trasactional updates to data

When all you need is an unbounded stream of data that you need to traverse in order to do all these things.

>>dmitri+UAb
> - have a snapshot of data

Being able to see a snapshot is good, and I would hope to see a higher-level abstraction that can offer that on top of something Kafka-like. But making the current state the primary thing is a huge step backwards, especially when you don't get a history at all by default.

> - query data, including ad-hoc querying

OK, fair, ad-hoc queries are one thing that relational databases are legitimately good at. Something that can maintain secondary indicies and do query planning based on them is definitely useful. But you're asking for trouble if you use them in your live dataflow or allow ad-hoc queries to write to your datastore.

> - have trasactional updates to data

I do think this one is genuinely a mistake. What do you do when a transaction fails? All of the answers I've heard imply that you didn't actually need transactions in the first place.

>>lmm+ACb
> But making the current state the primary thing is a huge step backwards

Why?

When is "I need to query all of my log to get the current view of data" is a step forward? All businesses operate on the current view of data.

> OK, fair, ad-hoc queries are one thing that relational databases are legitimately good at.

Not just ad-hoc queries. Any queries.

> But you're asking for trouble if you use them in your live dataflow or allow ad-hoc queries to write to your datastore.

In our "live datafows" etc. we use a pre-determined set of queries that are guaranteed to run multiple orders of magnitude faster in a relational database on the current view of data than having to reconstruct all the data from an unbounded stream of raw events.

> What do you do when a transaction fails?

I roll back the transaction. As simple as that.

>>dmitri+7Eb
> All businesses operate on the current view of data.

All businesses operate in response to events. Most of the things you do are because x happened rather than because the current state of the world is y.

> In our "live datafows" etc. we use a pre-determined set of queries that are guaranteed to run multiple orders of magnitude faster in a relational database on the current view of data than having to reconstruct all the data from an unbounded stream of raw events.

If you have a pre-determined set of queries, you can put together a corresponding set of stream transformations that will compute the results you need much faster than querying a relational database.

> I roll back the transaction. As simple as that.

And then what, completely discard the attempt without even a record that it happened?

>>lmm+gGb
> All businesses operate in response to events.

Yes, but once an event happens, business needs access to current state of data.

> If you have a pre-determined set of queries, you can put together a corresponding set of stream transformations that will compute the results you need much faster than querying a relational database.

No, it won't. Because you won't be able to run "a corresponding set of transformations" on, say, a million clients.

You can, however, easily query this measly set on a laptop with an "overengineered" relational database.

> completely discard the attempt without even a record that it happened?

Somehow in your world audit logging doesn't exist.

>>dmitri+uHb
> No, it won't. Because you won't be able to run "a corresponding set of transformations" on, say, a million clients.

Of course you can. It's a subset of the same computation, you're just doing it in a different place.

> Somehow in your world audit logging doesn't exist.

If you have to use a separate "audit logging" datastore to augment your relational database then I think you've proven my point.

>>lmm+oKb
I would like to disagree, in my experience eventing/cqrs are wonderful solutions to a set of problems (specially where event by event playback is a primary functionality). In most other cases it’s overkill and maintaining a snapshot of state, which like you said is inevitable even in the event log case, is imperative.

There are just too many scenarios where not having transactions is dog slow or really really unwieldy.

>>eklavy+zPb
In my experience when you do something that requires transactions - i.e. some complicated calculation based on the current state of the world that you can't reduce to a sequence of events and transformations between them - you always end up regretting it. Almost by definition, you can't reproduce what the transaction was supposed to do, and if there are bugs in your logic then you can't fix them; often you can't even detect that they happened.

zlacker