Hacker News new | past | comments | ask | show | jobs | submit login
An opinionated map of incremental and streaming systems (scattered-thoughts.net)
163 points by mpweiher 11 days ago | hide | past | favorite | 27 comments

Great taxonomy breakdown.

The low temporal locality part of the design space is particularly interesting. It is extremely general but for many data models "low temporal locality" can mean highly variable windows on a per entity basis and not on the entire collection, which is a challenging criteria for the way most systems are designed. Physical world sensor observation data models are a good example of this. Existing solutions for low temporal locality data models scale quite poorly in practice, even on systems nominally designed for them.

At the limit, low temporal locality platforms converge on the same unusual design requirements as symmetric mixed-workload database kernels. The latter is something we really don't (know how to) build, since it has some interesting computer science constraints. It is a missing piece of technology to make this stuff really work well at scale.

I like the unstructured/structured boundary as it mirrors the "do you use a library" (some Rx) or "do you use a client" (like Kafka).

I'm writing a programming-language ( http://www.adama-lang.org/ ) which is reactive, and I'm currently looking at how do I optimize (prematurely, of course). I believe a combination of static analysis and a good runtime working together are the future (at least, for me).

For instance, I currently statically optimize "iterate _table where id == x" to "_table.get(x)", and this was a good optimization. I use the same analysis to introduce some limited indices.

The paradox for my use case (board games) is that going further isn't really needed, and the overhead of optimization starts to cost more than brute force table scans. However, I'm excited about this approach beyond board games.

No mention of Apache Spark? Thinking about where Spark would fit in here, because it has both structured and unstructured APIs--I think it would fall under most of these categories depending how you use, going back to the comment in the article:

> Most of these systems are equally expressive and can emulate each other

But a quick stab at how I think it could fall under this taxonomy:


    If using the DataFrame API


    If using the DStream API


    [low temporal locality]

    If you don't watermark and let the state grow--which I don't have much experience with,
    but I think could possibly run into scalability issues, you'd have to write some
    imperative logic to cull state--but it's very necessary for low temporal locality.

    [high temporal locality]

    If using watermarks, it will close the and take the output for you


    [internally consistent / internally inconsistent]

    Not really sure how Spark falls here, I think if you use the DataFrame API, you can have
    it be consistent, but a lot of operations aren't supported, so you may end up having to
    switch to DStreams and write code imperatively.

    And then further, what if one of the streams is processed faster than each the other in
    an aggregation?  I'm not sure if there's a way to specific business rules around making
    things atomic--but I'm sure you could hack it if you dropped low level enough.

    I think this is touched upon in the other article:
    > When combining multiple streams it's important to synchronize them so that the outputs
    > of each reflect the same set of inputs.

If someone has more experience with Spark [Structured] Streaming, I would love to hear your thoughts. I stick mostly with Spark's batch jobs, playing around with datasets in a REPL/Notebook--which it really shines for.

That said, I'm really excited by the future of Spark Structured Streaming--I want to write declarative code and say how my data gets from A to B--not what needs to happen to get from A to B.

Spark structured streaming is in there under structured, high temporal locality.

It didn't make it into https://scattered-thoughts.net/writing/internal-consistency-... because it has severe limitations for low temporal locality operations:

    * As of Spark 2.4, you can use joins only when the query is in Append output mode. Other output modes are not yet supported.
    * As of Spark 2.4, you cannot use other non-map-like operations before joins. Here are a few examples of what cannot be used.
        * Cannot use streaming aggregations before joins.
        * Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.

    * There are a few DataFrame/Dataset operations that are not supported with streaming DataFrames/Datasets. Some of them are as follows.
        * Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF) are not yet supported on streaming Datasets.
        * Limit and take the first N rows are not supported on streaming Datasets.
        * Distinct operations on streaming Datasets are not supported.
        * Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode.
        * Few types of outer joins on streaming Datasets are not supported.
I haven't looked into the implementation but I'm guessing they just don't have good support for retractions in aggregates/joins. I also think it's likely to at least fall afoul of early emission and confusing changes with corrections.

The rest of spark doesn't really fit the post at all afaict - there is no streaming or incremental update, just batch stuff.

Side note: one thing I've really noticed lacks formalism in a lot of programming languages and environments is easily streaming data and performing operations on a data "pipe". I normally dislike javascript, but Node.js actually does a pretty cool job of it with their streams API. Really wish every language had something like that built in, with flow control, etc., so you could e.g. pipe data through compression => encryption => measure size => upload and those sorts of flows. Even more useful is when all of these steps can run somewhat in parallel e.g. while datum 1 is being encrypted, datum 2 is being compressed, etc.

Elixir has a few interesting abstractions for that: GenStage, Flow, Broadway.


Go Channels?

i found this very interesting.

we have a homegrown push-based streaming library in .net based on reactive (it supports indexed joins of "tables") and we're looking for something more infrastructural as we move to the cloud.

Would love to hear more options in that bottom left box with differential dataflow et al, because for me that's where all the really interesting work is happening.

Checkout differential datalog. Looks super interesting: https://github.com/vmware/differential-datalog

But I've only played with souffle, which I am not sure where it lands in this taxonomy. I think it lands on the high temporal locality but compared to differential datalog, the biggest difference in my opinion is the souffle solves problems as a batch (i.e., all the inputs are known at command issue time) while differential datalog may receive inputs at runtime.

There's also incA: which I think competes with differential datalog. https://github.com/szabta89/IncA

So, how I see Datalog (which is a subset of prolog) falling into this taxonomy:

Datalog (batch-processing) would fall under structured/high-temporal locality/internally consistent.

Datalog (incremental) would fall on structured/low-temporal locality/internally consistent.

Looking a little closer on what is defined as "consistent" here, incremental Datalog might not be "consistent" because it might return an "incorrect" or "unavailable" computation for a past input if you "update" or modify the input. But if one restricts the subset of Differential Datalog to follow functional semantics, then that would be consistent. Not really super knowledgable about this projects beyond playing with them and reading some papers on Datalog/Souffle.

I think differential dataflow might have some application to the less structured domain and that makes it interesting.

Somewhat off-topic, but I feel like the "patronage economy" is really taking off, with Patreon and GitHub sponsors. Content like this is really interesting, and I've appreciated this blog's various write-ups in the past.

There's another great blog, with writings supported by patrons, whose name escapes me (I think the domain ends in something about "lime", with the '.me' ccTLD).

Whenever I see those domains on HN, I always click, since I have a high expectation the content will be good, given how it was created.

It's getting there. I started 3.5 months ago and I'm currently making ~75% of minimum wage (which to be fair is pretty high in BC).

I suspect it's also going to be fragile. The next time we hit a recession or, god forbid, another pandemic, sponsorships will probably be the first thing to go.

But it does help fill gaps. There are a lot of things I've wanted to code, study or write about that I think people will get value from but that I could not get any employer to pay for. So I really appreciate being able to work on things just because they're valuable, rather than because there is a way to capture the value and make a profit.

Did you mean

https://fasterthanli.me/articles/aiming-for-correctness-with... ?

I've been following the rust content, it's a readable and useful (to me) blog

I feel like this map is missing something to tease apart systems that are not interchangeable in practice (UI state management vs database stuff).

Maybe persistent vs in-memory? Like whether the data is typically completely blown away and recomputed on page reload (e.g., ui state, dom trees, scene graphs).

There are systems which are simultaneously in-memory and persistent, e.g. Tarantool, Starcounter

To me the unstructured area is the future of applications development and the structured area is barren and overhyped.

(e.g. 'eventually consistent' analytics is so 2021, because it looks like you are doing something quantitative,except it doesn't matter if you get the right answer. It will get you claps from a certain audience but most people lose patience pretty quick when the 'numbers don't add up'.)

I can’t really relate your first paragraph to your second, at least by the taxonomy in the article which has some consistent structured systems as well as inconsistent systems.

I also disagree that unstructured is the future. I think most computations actually are structured. Otherwise sql queries wouldn’t be so useful. I think a lot of processes basically start with a big bag of foos and end up with a big bag of roughly corresponding bars, so if one foo changes then there is only a localised change to the output. That can be encoded with an unstructured system but I find they are better for cases where outputs are more scalar and the processes between inputs and outputs are myriad. While you can do structured-like things in an unstructured system, you lose out on a lot of the advantages of that structure.

What I want is recognition of the structured and the unstructured.

For instance a tool like this


maintains a database of timely information and can use rules to match events and could be used to manage the numerous problems of event-based architectures.

That particular tool mangles Java source code badly in the process of compiling rules so it gives error messages that make no sense at all, even if you are looking at the compiler's source code in the other window and at the running compiler in the debugger.

In the light of the interest in "low code" and the real success of "business rules" for domains they apply in I am amazed there has been less effort to apply production rules to the "update the ui when the database changes".

Even though that can look unstructured, the implementation of the rules can be done with the same relational operators that all those other methods used. A few year ago you would have had to specified the indexes by hands, but the self-optimizing database patent from Salesforce is run out now and there is no reason the system can't learn the frequent query patterns itself.

IMO the key with eventual consistency is making it known which bits are currently known to be consistent, and which are still in flux. I realize that this is vague, but how that works depends on what it is you're processing and how much time it takes for consistency to emerge.

If you're doing any kind of Important Reporting on eventually consistent data, you'd better make sure that you either know you're only including finalised data, or that there's a big fat warning with error bars.

This is why the high temporal locality part of the map is all EC - when everything is windowed you can just wait for the window to close.

On the low temporal locality side, the only consistent system I've seen so far is differential dataflow (and materialize) which does internally track which results are consistent and gives you the option to wait for consistent results in the output.

Exactly. People act like EC is impossible to make viable, and ignore the fact that transactional logic can impose 10-100x performance loss. Putting an EC system in front of a transactional system can be a massive performance win.

This isn't quite what the Helios paper talks about, but there's lots of things like async indexing in there that are kinda similar in nature

I believe Redis belongs in this map.

AFAIK it does not perform computation or control the computation of other systems.

"Last updated 2021-04-18" (so not 2018?)

Fixed, thanks!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact