Hacker News new | past | comments | ask | show | jobs | submit login
Understanding Kafka with Factorio (2019) (ruurtjan.medium.com)
133 points by pul 63 days ago | hide | past | favorite | 72 comments



Here’s the HN discussion for Bartosz Milewski’s analysis of Factorio, where he shows functional counterparts in Haskell of Factorio’s patterns:

https://news.ycombinator.com/item?id=26157969

https://news.ycombinator.com/item?id=29299140


Must be something about Kafka to attract these kind of explanations. Another one few months back was a children's book on Kafka [1] . For me it just look like solution looking for actual problems.

I wonder if Kafka represents an existential angst in these Kubernetized Microservice times. Or is it more simply I am just too dumb to learn and use this shit correctly.

1. https://news.ycombinator.com/item?id=27541339


When you are wondering whether you might need Kafka, it is certain that you don't need it.

But there are times when you have a problem, and amongst the possible solutions is Kafka.

I've come across Kafkaesque problems only three times in the last seven years: a hosting platform that had to parse logs of over 700 WordPress sites for security and other businesslogic. Putting all events of a financial app backend into datalakes and filtering and parsing all openstreetmap changesets live.


I work in the oil and gas industry where legacy systems runs on their last breath. Kafka is a fantastic tool and solves a shit ton of problem. We have millions of sensors on an offshore installation, these all send data into kafka, where we generate events on new topics from different timeseries. Other data services consume these topics and get data updated in near realtime.

No more daily SQL dumps from offshore to onshore and big batch procedures to genereate outdated events.


Sounds like you have Serious Problems, for which Kafka is a very good solution.

For me, Kafka sits in the same area of solutions as Kubernetes, Hadoop clusters, or anything "webscale": you don't need it. Untill you do, but by then you'll (i) have Serious Problems which such systems solve and (ii) the manpower and budgets to fix them.

With which I don't mean to avoid Kafka at all costs. By all means, play around with it: if anything, the event-driven will teach you things that make your a better Rails/Flask/WordPress developer if that is what you do.


I'm in the same situation in the paper making industry. Kafka is an almost perfect match for our needs: high volume, durable storage, decoupled stream processing.


What legacy systems is the oil and gas using? MQTT? OPC-DA? OPC-UA?


Not sure I agree. It seems as good a way as any to decouple systems, asynchronously exchange messages between services, get them into durable storage, exactly once processing, replay semantics etc. I think it should be on the table at least whenever we have two services needing to exchange data.

Maybe a few use cases could be switched out for direct API calls, but I think Kafka hits the sweet spot in many situations.

What alternatives would you be looking at?


Some alternatives are:

* Just keep your architecture a monolith. You'll do fine the majority of the cases.

* Event-sourcing doesn't require Kafka clusters. Nor do event-driven setups. You don't need complex tooling to pass around strings/json-blurps. An S3 bucket or a Postgresql database storing "Events-as-json" is often fine.

* Postgres can do most of what you need (except for the "webscale" clustering etc)[0] in practice already.

* Redis[1]

My main point is that while Kafka is a fantastic tool, you don't need that tool to achieve what you want in many cases.

> It seems as good a way as any to decouple systems

IMO relying on a tool to achieve a good software design, rather than design-patterns, is a recipe for trouble. If anything, because it locks you in (do you suddenly get a tightly coupled system if you remove Kafka?) or because its details force you into directions that don't naturally fit your domain or problem.

--

[0] https://spin.atomicobject.com/2021/02/04/redis-postgresql/ [1] https://redis.com/redis-best-practices/communication-pattern... etc.


It is as good as any if you have no financial constraints and no technical overhead and time constraints, but everyone does.

Kafka is one of those systems that needs to be justified by out-scaling other solutions that don't come wedded with all its baggage.


What would you say is its baggage?


For me, the baggage is mostly the complexity of the service. With that comes monitoring, maintenance, tuning, debugging and troubleshooting.

Lessened somewhat with SaaS products like Amazon Kinesis (technically not a Kafka, but close).

Another "baggage" is that an event-driven setup is eventual-consistent -and async- by nature. If your software already is eventual-consistent, this is not a problem. But it is a huge change if you come from a blocking/simple "crud" setup.


I second Kafka having massive operational overhead. It's a burden and is killing any support for it within our org.


I’ve seen comments like the gp’s often enough that it strikes me as a form of gatekeeping.

MY problems are so special that my use of Kafka was perfect, but YOURS are trivial and you shouldn’t even consider Kafka.


> it strikes me as a form of gatekeeping

I consider this a form of gatekeeping of advice on using Kafka.


Sometimes I wonder if I'm the crazy one. Kafka seems to me to be the only sensible foundational datastore out there: it can maintain and propagate a log with all the properties you would want a datastore to have. Relational database seem to be a crazily overengineered solution in search of a problem, with incredibly poor reliability properties (essentially none of them are true master-master HA out of the box, and they tend to require significant compromises to make them so) to boot.


Both are good for different purposes.

Want to store data, query it in arbitrary and hard to foresee ways and also want to easily tune performance for these queries? Relational datastore it is.

Want to have ACID? Well, relational datastore it is.

Kafka is not the right solution for these problems.


> Relational database seem to be a crazily overengineered solution in search of a problem

I mean, whoever in their right mind would want to:

- have a snapshot of data

- query data, including ad-hoc querying

- query related data

- have trasactional updates to data

When all you need is an unbounded stream of data that you need to traverse in order to do all these things.


> - have a snapshot of data

Being able to see a snapshot is good, and I would hope to see a higher-level abstraction that can offer that on top of something Kafka-like. But making the current state the primary thing is a huge step backwards, especially when you don't get a history at all by default.

> - query data, including ad-hoc querying

OK, fair, ad-hoc queries are one thing that relational databases are legitimately good at. Something that can maintain secondary indicies and do query planning based on them is definitely useful. But you're asking for trouble if you use them in your live dataflow or allow ad-hoc queries to write to your datastore.

> - have trasactional updates to data

I do think this one is genuinely a mistake. What do you do when a transaction fails? All of the answers I've heard imply that you didn't actually need transactions in the first place.


> But making the current state the primary thing is a huge step backwards

Yeah, who could need to know exactly how many items of a particular product they have in stock currently, or how much money a customer has in her account at the particular moment she wants to do a withdrawal? It's really hard to come up with any useful real world examples when this could be the case.

> What do you do when a transaction fails?

It depends on why the transaction fails and in which way. But sometimes it is really useful to make sure that when one account is debited, another one is credited at the same time.


> Yeah, who could need to know exactly how many items of a particular product they have in stock currently, or how much money a customer has in her account at the particular moment she wants to do a withdrawal? It's really hard to come up with any useful real world examples when this could be the case.

Having access to the current state of the world is useful, having a log of what happened / how it got that way is essential. You've got to get the foundations right before you build a monumental edifice on top.


> But making the current state the primary thing is a huge step backwards

Why?

When is "I need to query all of my log to get the current view of data" is a step forward? All businesses operate on the current view of data.

> OK, fair, ad-hoc queries are one thing that relational databases are legitimately good at.

Not just ad-hoc queries. Any queries.

> But you're asking for trouble if you use them in your live dataflow or allow ad-hoc queries to write to your datastore.

In our "live datafows" etc. we use a pre-determined set of queries that are guaranteed to run multiple orders of magnitude faster in a relational database on the current view of data than having to reconstruct all the data from an unbounded stream of raw events.

> What do you do when a transaction fails?

I roll back the transaction. As simple as that.


> All businesses operate on the current view of data.

All businesses operate in response to events. Most of the things you do are because x happened rather than because the current state of the world is y.

> In our "live datafows" etc. we use a pre-determined set of queries that are guaranteed to run multiple orders of magnitude faster in a relational database on the current view of data than having to reconstruct all the data from an unbounded stream of raw events.

If you have a pre-determined set of queries, you can put together a corresponding set of stream transformations that will compute the results you need much faster than querying a relational database.

> I roll back the transaction. As simple as that.

And then what, completely discard the attempt without even a record that it happened?


> All businesses operate in response to events.

Yes, but once an event happens, business needs access to current state of data.

> If you have a pre-determined set of queries, you can put together a corresponding set of stream transformations that will compute the results you need much faster than querying a relational database.

No, it won't. Because you won't be able to run "a corresponding set of transformations" on, say, a million clients.

You can, however, easily query this measly set on a laptop with an "overengineered" relational database.

> completely discard the attempt without even a record that it happened?

Somehow in your world audit logging doesn't exist.


> No, it won't. Because you won't be able to run "a corresponding set of transformations" on, say, a million clients.

Of course you can. It's a subset of the same computation, you're just doing it in a different place.

> Somehow in your world audit logging doesn't exist.

If you have to use a separate "audit logging" datastore to augment your relational database then I think you've proven my point.


I would like to disagree, in my experience eventing/cqrs are wonderful solutions to a set of problems (specially where event by event playback is a primary functionality). In most other cases it’s overkill and maintaining a snapshot of state, which like you said is inevitable even in the event log case, is imperative.

There are just too many scenarios where not having transactions is dog slow or really really unwieldy.


In my experience when you do something that requires transactions - i.e. some complicated calculation based on the current state of the world that you can't reduce to a sequence of events and transformations between them - you always end up regretting it. Almost by definition, you can't reproduce what the transaction was supposed to do, and if there are bugs in your logic then you can't fix them; often you can't even detect that they happened.


> > No, it won't. Because you won't be able to run "a corresponding set of transformations" on, say, a million clients.

> Of course you can.

Of course, you can't. Because you can't run a million transformations. Whereas querying specific data for any of the one million clients? It's trivial on a relational database.

Moreover. If you need new queries into data, it's again trivial. Because you have the current view of your data, and you don't need to recalculate everything from the beginning of time just because your requirements ever so slightly changed.

> If you have to use a separate "audit logging" datastore to augment your relational database then I think you've proven my point.

No, I haven't.

It's funny, however, that you think that businesses don't require a current view of data and need to re-calc everything from scratch.


You think transactional updates are a mistake? I guess you’ve never had to worry about the integrity of your data?

Bank -> debit card purchase -> perform all required database work in a transaction -> transaction fails -> decline debit card purchase

Without transactions, in this scenario, maybe the debit card transaction fails but money is still taken out of your account? Doesn’t sound very pleasant.


Database transactions would be useless for banks - you'll notice that if you try to pay while not having enough money in your account, the attempt doesn't simply disappear. What banks actually do is something akin to what Kafka-based systems do - the attempt to charge is recorded in a leger, and then fulfilment of that happens asynchronously.


That's seriously one of the worst takes I've seen on this website.

>the attempt to charge is recorded in a leger

Hint: how do you think this attempt is recorded and fulfilled? Or, do you think "it's just appended" and bank recalculates your balance from scratch every time you spend 1$ on coke can?

Only bank I've heard of that's not using traditional relational database for ledger is Monzo [1] - but they still use Cassandra's transactions.

[1] https://www.scaleyourapp.com/an-insight-into-the-backend-inf...


> Hint: how do you think this attempt is recorded and fulfilled? Or, do you think "it's just appended" and bank recalculates your balance from scratch every time you spend 1$ on coke can?

That's how the bank I worked with did it. Of course there was caching in place so we didn't actually recompute everything every time, but the implementation of that was a lot closer to "commit a kafka offset" than an RDBMS-style transaction. (E.g. we didn't overwrite the "current balance" in-place, we appended a new "current balance as of time x").


> Of course there was caching in place so we didn't actually recompute everything every time

I think you've proved our point


every large ingest app i have worked on had something akin to kafka, from raw 3d seismic broadcast via sat, to RF tower motion detection, to carrier grade cellular billing. common denominator was a replayable ingest queue. yes, kafka is a "great" idea. however, it is not a replacement for querying.


The reason the things are called "transactions" is because they were invented for banks. Bank ledgers must always be appended in pairs or not at all, never in any other number. The entire finance system is dependent on that feature.

Transactions were kept by humans, literally for a few centuries, before the algorithm was adapted for computers.


RDBMS "transactions" work nothing like banking transactions; when you abandon a banking transaction you don't erase it as though it never happened, you keep a full record of what happened but the downstream consequences change. In fact the way traditional accounting is done, with transactions first committed to journals and then later asynchronously propagated into ledgers, is more closely akin to how Kafka-based systems work.


This is missing the whole point of Kafka consumers, which is to consume data from a topic and do something with it.

One of those things being “store it in a relational model” or “write a sum to a key value store” or something else.

This ability comes for free with Kafka, but is very not-free when using a relational model.


They were answering this statement:

> Relational database seem to be a crazily overengineered solution in search of a problem

Why would an answer to that need to mention Kafka consumers?


* When all you need is an unbounded stream of data that you need to traverse in order to do all these things.*

This is the part I was responding to.


You're talking about "putting a snapshot of the data somewhere" - the person you're replying to is replying to someone who says this sort of snapshot is pointless.


> This is missing the whole point of Kafka consumers

It's not missing that because it doesn't even address that. I'm answering a specific point.


you can efficiently do someting with kafka that is equivalent to maintaining relational tables and joining them? seriously?


The otters definitely overengineered the living crap out of their party notifications.


Turns out some otters did have doubt. But "if Linkedin use Kafka, it must be Good" clinched the deal.


In our org, all the website events are pushed into Kafka and it enabled all the teams to develop real-time event driven applications using these events without needing any access to website database or any infra that handles the main website. It’s truly a blessing in large organisations.


This is exactly what I thought. Maybe it's because it's popular or maybe it's because the docs are notoriously bad. Or both.


the same can be said about microservices.


This analogy is as thin and trivial as it is disappointing. Those concepts could have and have been easily understandably explained in a single paragraph. Don't even see the appeal. This analogy is not providing any new or interesting insights whatsoever.


If anyone is starting a new project, I'd recommend looking into Apache Pulsar [0]. It has all the good parts of Kafka with a lot more features useful when scaling

0: https://pulsar.apache.org/


I've been looking for non-JVM replacement for Kafka, though I might also give this a try.



Ah yes, fear uncertainty & doubt. The number one, two, and three arguments (not necessarily in that order) against progress & the better.


I am not kidding when I say that you can probably use Factorio in your technical interview as a company, lol.


There was a post about that in HN a while back. https://news.ycombinator.com/item?id=26591966


I missed that back then. Interesting idea, but I'm a senior developer with 20 years of experience, and my 12 year old son is better at Factorio than I am, so I'm not so sure about this.

But it's interesting to compare his factorio style to mine. Or my factorio style to my regular programming style. They're very different.


You won't get good on the game if you don't have any aptitude to programming. But being bad at the game means nothing, and it certainly doesn't reflect any expertise, just basic aptitude.

It also only reflects one of the many tasks a developer usually does as an employee, that is creating algorithms. For a senior developer, that's probably not an apt task for differentiation, because it's quite bounded.


linkedin resume with total hours played


Everything with Factorio in name gets my upvote.


Kafka is awesome, but I have one major gripe with it. It gives a solid interface to JVM applications. But if your application is outside of the JVM and you want it to consume from a topic, it's a terrible experience.


I have found clients based on librdkafka [1] or sarama [2] to be pretty robust and capable. For SSL, librdkafka relies heavily on OpenSSL which can cause a few headaches IME. But the rest is straightforward.

[1] https://github.com/edenhill/librdkafka [2] https://github.com/Shopify/sarama


They say to get all the subtleties of kafka you should read the original german ;)


But Kai Waehner blogs in English :)


Beautiful explanation through analogies a s graphics supporting it. Thanks a lot


This is a wonderful format for explaining software engineering practices and I hope to see more like this.


Very useful article! Awesome job


I presume the percentage of HN user who know this is about Apache Kafka is higher than the percentage that think this is about Franz Kafka. :-)

Related HN discussion of this [1]

[1] https://news.ycombinator.com/item?id=29296969


The Kafka subreddit is good for chuckles because at least once a month someone happens by who is looking for help with their deployment. It's a Kafkaesque loop.


I genuinely thought it was going to be an introduction to Franz Kafka.


Me too, and I came here because I thought it was almost impossible to explain Kafka with Factorio (can one explain Kafka at all?)

But yeah, parallel programming is easy to explain with Factorio.


"Kafka" *really* doesnt seem like a good name for some software that is supposed to be good / helpful.


It really does not matter

Who in their right mind would conflate the two


I opened the thread thinking it was about the guy, not that I know a lot about either Kafka. I know what Factorio is, but I thought maybe it'd be philosophical and use analogies. Like if someone wrote about understanding feudalism through Animal Crossing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: