Hacker News new | past | comments | ask | show | jobs | submit | prions's comments login

Stream quality and catalog sizes are both legal issues with music rights holders and have nothing to do with streaming innovation. Unless you consider Apple having a boatload of money to make legal issues go away a kind of innovation.

Because the aggrieved aren’t satisfied only by Elon taking over twitter and taking it in a different direction, they also believe that those currently at twitter deserve harsh punishment for whatever role they had in twitter.


There have been a lot of people cheering twitters downfall because they feel personally or politically vindicated. But it’s especially bewildering to see how many people here on Hn completely cheer on the absolute disrespect and humiliation of twitter employees as a good thing.

IMO Data engineering is already a specialized form of software engineering. However what people interpret as DE's being slow to adopt best practices from traditional software engineering is more about the unique difficulties of working with data (especially at scale) and less about the awareness or desire to use best practices.

Speaking from my DE experience at Spotify and previously in startup land, the biggest challenge is the slow and distant feedback loop. The vast majority of data pipelines don't run on your machine and don't behave like they do on a local machine. They run as massively distributed processes and their state is opaque to the developer.

Validating the correctness of a large scale data pipeline can be incredibly difficult as the successful operation of a pipeline doesn't conclusively determine whether the data is actually correct for the end user. People working seriously in this space understand that traditional practices here like unit testing only go so far. And integration testing really needs to work at scale with easily recyclable infrastructure (and data) to not be a massive drag on developer productivity. Even getting the correct kind of data to be fed into a test can be very difficult if the ops/infra of the org isn't designed for it.

The best data tooling isn't going to look exactly like traditional swe tooling. Tools that vastly reduce the feedback loop of developing (and debugging) distributed pipelines running in the cloud and also provide means of validating the output on meaningful data is where tooling should be going. Trying to shoehorn traditional SWE best practices will really only take off once that kind of developer experience is realized.

> Validating the correctness of a large scale data pipeline can be incredibly difficult as the successful operation of a pipeline doesn't conclusively determine whether the data is actually correct for the end user. People working seriously in this space understand that traditional practices here like unit testing only go so far.

I'm glad to see someone calling this out because the comment here are a sea of "data engineering needs more unit tests." Reliably getting data into a database is rarely where I've experienced issues. That's the easy part.

This is the biggest opportunity in this space, IMHO, since validation and data completeness/accuracy is where I spend the bulk of my work. Something that can analyze datasets and provide some sort of ongoing monitoring for confidence on the completeness and accuracy of the data would be great. These tools seem to exist mainly in the network security realm, but I'm sure they could be generalized to the DE space. When I can't leverage a second system for validation, I will generally run some rudimentary statistics to check to see if the volume and types of data I'm getting is similar to what is expected.

There is a huge round of "data observability" startups that address exactly this. As a category it was overfunded prior to the VC squeeze. Some of them are actually good.

They all have various strengths and weaknesses with respect to anomaly detection, schema change alerts, rules-based approaches, sampled diffs on PRs, incident management, tracking lineage for impact analysis, and providing usage/performance monitoring.

Datafold, Metaplane, Validio, Monte Carlo, Bigeye

Great Expectations has always been an open source standby as well and is being turned into a product.

Thanks for the recommendations, I'm going to check some of them out.

Engineers demanding unit test for data is a perfect test to weed out the SWEs who are bit DEs. Ask about experience with data quality and data testing when you interview candidates and you'll distinguish the people who will solve a problem with a simple relational join in 1 hour (DEs) vs those who will try to unknowingly build a shitty implementation of a database engine to solve a problem in one month (SWEs trying to solve data problems with C++ or Java).

Unit testing is a means to an end: how do we verify that code is correct the first time, and how do we set ourselves up to evolve the code safely and prevent regressions in the future?

Strong typing can reduce the practical need for some unit testing. Systems written in loosely-typed languages like Python and JavaScript often see real-world robustness improvements from paranoid unit testing to validate sane behavior when fed wrongly-typed arguments. Those particular unit tests may not be needed in a more strongly typed language like Java or TypeScript or Rust.

Similarly, SQL is a better way to solve certain data problems, and may not need certain checks.

Nevertheless, my experience as a software developer has taught me that in addition to the code I write which implements the functionality, I need to write code which proves I got it right. This is in addition to QA spot checking that the code functions as expected in context (in dev, in prod, etc.) Doing both automated testing and QA gets the code to what I consider an acceptable level of robustness, despite the human tendency to write incorrect code the first time.

There are plenty of software developers who disagree about that and eschew unit testing in particular and error checking in general — and we tend to disagree about how best to achieve high development velocity. I expect there will always be such a bifurcation within the field of data engineering as well.

If you need to maintain some sort of deductive correctness - ie. my inputs are correct and my code is correct therefore my outputs are also correct - you're gonna expose yourself to only a tiny amount of real world problems.

Data engineering is typically closely aligned with business and it's processes are inherently fuzzy. Things are 'correct' as long as no people/quality checks are complaining. There is no deductibe reasoning. No true axioms. No 'correctness'. You can only measure non-quality by how many complaints you have received, but not the actual quality, since it's not a closed deductive system.

Correctness is also defined by somebody downstream from you. What one team considers correct, the other complains about. You don't want to start throwing out good data for one team just because somebody else complained. But many people do. Or typically people coming from SWE into DE tend to, before they learn.

I've worked with medium-sized ETL, and not only does it have unique challenges, it's a sub-domain that seems to reward quick and dirty and "it works" over strong validation.

The key problem is that more you validate incoming data, the more you can demonstrate correctness, but then the more often data coming in will be rejected, and you will be paged out of hours :)

I also manage a medium sized set of ETL pipelines (approx 40 pipelines across 13k-ish lines of Python) and have a very similar experience.

I've never been in a SWE role before, but am related to and have known a number of them, and have a general sense of what being a SWE entails. That disclaimer out of the way, it's my gut feeling that a DE typically does more "hacky" kind of coding than a SWE. Whereas SWEs have much more clearly established standards for how to do certain things.

My first modules were a hot nasty mess. I've been refactoring and refining them over the past 1.5 years so they're more effective, efficient, and easier to maintain. But they've always just worked, and that has been good enough for my employer.

I have one 1600 line module solely dedicated to validating a set of invoices from a single source. It took me months of trial and error to get that monster working reliably.

Oddly, this sounds like the difference between inductive and deductive systems.

This is actually a great observation. Data pipelines are often written in various languages, running on heterogenous systems, with different time alignment schemes. I always found it tricky to "fully trust" a piece of result. Hmm, any best practice from your side?

Without getting into the weeds of it, I'd say smooth out the rough edges in your development experience and make it behave as similar to prod as possible. If there's less friction there's less incentive to cut corners and make hacks imo.

Some pain points:

- Does it take forever to spin up infra to run a single test?

- Is grabbing test data a manual process? This can be a huge pain especially if the test data is binary like avro or parquet. Test inputs and results should be human friendly

- Does setting up a testing environment require filling out tons of yaml files and manual steps?

- Things built at the wrong level of abstraction! This always irks me to experience. Keep your abstractions clean between which tools in your data stack do what. When people start inlining task-specific logic at the DAG level in airflow, or let their individual tasks figure out triggering or scheduling decisions is when things just become confusing.

Right now my workflow allows me to run a prod job (google cloud dataflow) from my local machine. It consumes prod data and writes to a test-prefixed path. With unit tests on the scala code + successful run of the dataflow job + validation and metrics thrown on the prod job I can feel pretty comfortable with the correctness of the pipeline.

Not OP, but a Data Engineer with 4 years experience in the space - I think the key is to first build the feedback loop - i.e. any thing that helps you answer how do you know the data pipeline is flowing and that the data is correct - then getting sign-off from both the producers and consumers of the data. Actually getting the data flowing is usually pretty easy after both parties agree about what that actually means.

Same with Fly.io, SQLite, Hetzner, etc. Hackernews is a pretty effective way to advertise to developers.

Don't know about Hetzner but Fly.io and SQLite (who now seem to sort of be intertwined in making SQLite Cool Again™) just have great content.

Hetzner/OVH are HN darlings not because of how innovative they are in features, but in pricing.

They tend to show up as submissions mostly when something bad happens (e.g. price increase), but they still show up in comments a lot when people discuss how comically expensive cloud providers are in a direct comparison and how you can be similarly effective running on such pricing-focused hosts, especially if you know how to build the redundancy yourself (or don't need it).

Which is a pretty effective way to advertise to developers.

Wish there was a nice HN coffee table book documenting historic trends on here... from first post to current print date. Maybe just a bunch of illustrations of key words and a small foot note accompanying each

The marketing strategy for anything remotely developer-oriented is literally just to find those posts where somebody figured out the best times to submit on HN.

So Twitter isn't important except for:

* influential politicians

* journalists

* Jerome powell "... though there's some evidence that Jay Powell, the chairman of the Federal Reserve, may consult it for ideas on monetary policy"

* College educated people (94 million americans)

I don't think this is the own-the-libs that the author intended

The trueup data casts a pretty wide net. Seeing Microsoft on that list and makes the impression that layoffs are hitting FAANGs in a similar way to high growth startups, but those layoffs were due to Microsoft pulling out of Russia.


Similarly, the Tesla layoffs sounded like an office of in-house data labelers close to the tool developers in California being replaced by outsourced temp workers after the labeling tools reached maturity. It's a stretch to call that a tech layoff.

Amazon Mturk is cheap too

It's not unusual for companies who are doing fine to use industry layoffs as cover to cut under-performing employees.

Also contained Better. Debatable how much it is a tech company, and I believe many of the layoffs were mortgage loan officers.

I don’t understand how comparing Trudeau to Hitler is a reasonable position.

It’s simply not even remotely close to true.

Especially considering Trudeau hasnt lead a massive genocide or tried to take over the continent.

IFS patient here as well. Totally agree on how powerful it can be, and how deep of a connection you can get with some far off parts of yourself. I find the whole concept a great mental model of how my mind works.

Also as a software engineer whose always been on the analytical/intellectual side of reasoning, I prefer the psychodynamic nature IFS as opposed to modalities like CBT. Really helped open me up to being more emotionally intelligent and aware.

that's awesome. I think CBT is really reductive. Quite pessimistic, as well - CBT seems to hold little hope of actually healing many issues, but only aims for symptom management. I think that's because it's not a very good therapy for many (most?) situations so it has to rationalize it by saying true healing is impossible.

So the author points to a link claiming 12% of Ukrainians own crypto, yet triple-a creates a circular reference in trying to substantiate this claim:

- The author refers to: https://triple-a.io/crypto-ownership-ukraine/

- This link calls it an estimate: "It is estimated that over 5.5 million people, 12.7% of Ukraine’s total population, currently own cryptocurrency.(1)"

- Link (1) points to https://triple-a.io/crypto-ownership/

- The link under "Ukraine" in that list points back to the first claim: https://triple-a.io/crypto-ownership-ukraine/

At the bottom of their data page: "The data contained or reflected herein are proprietary of TripleA."

This is literally an april fools joke...

So all of the data for https://triple-a.io/crypto-ownership/ is fake?

It might or might not be fake. All we can say is that it can't be substantiated by anything on their website (they aren't referring to external sources), and that they are pretending that their data has been independently substantiated (presumably hoping nobody actually follows the citation links).

It is quite a sight to watch someone explain to someone else that something that they read on the internet, might not be believable.

Given the work he put into it, I took it as a serious steelman argument regardless so I figured he'd give his all for making the case.

True, and the concerning thing is that it’s indistinguishable from the normal PR and marketing pieces. The Matt Damon commercials are ridiculous.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact