Applied Machine Learning for Tabular Data

levocardia · 2024-07-26T02:27:05

My experience with tabular data textbooks is that they introduce many different techniques, in a relatively un-opinionated way, which doesn't help give you intuitions on what your strategy for real-world problems should be. In practice, almost all tabular data problems I've encountered boil down to:

1. Make sure you don't have any impossible nonsense or data leakage in your features or target

2. Split your data in an intelligent way (temporally for time-series, group-wise for hierarchical data)

3. Try a really simple linear regression / logistic regression model to get a "dumb" baseline for your accuracy/error metric and make sure it is reasonable

4. If you need interpretability, consider a GAM, else, throw it into XGBoost and you'll get state of the art results

jamesblonde · 2024-07-26T05:20:09

I am writing a tabular data textbook for O'Reilly - on building ML systems with a feature store.

I try to be opinionated about modelling - XGBoost is all you really need, but the challenges are more like you say - how to prevent data leakage (ASOF LEFT JOIN (or use a feature store)), separating model-independent data transformations from model-specific data transformations, APIs for things like time-series splits, logging, monitoring, building and operating the pipelines. All pretty standard software engineering in Python nowadays.

Free chapters:

https://www.hopsworks.ai/lp/oreilly-book-building-ml-systems...

teleforce · 2024-07-26T09:20:42

Personally I am very interested on building big data pipeline for machine learning with initially batch and then real-time data of ECG and seismic, for CVDs screening/early detection and earthquakes early detection/prediction respectively. Any idea when the completed book will be available?

Just wondering what is the main difference between your book and this book, Architecting Data and Machine Learning Platforms also from O'Reilly:

https://www.oreilly.com/library/view/architecting-data-and/9...

jamesblonde · 2024-07-26T13:36:27

My book is a hands-on book, where you build AI systems. The first 4 chapters are already out.

I have run a course at KTH for years, here are the AI systems they built in 2024 over a 2-3 week period. There was an earthquake project amongst them!

https://id2223kth.github.io/assignments/2024/ID2223Projects2...

fbdab103 · 2024-07-26T03:25:53

For 4, I would always start with RF. You have fewer knobs to turn (aka degrees of freedom with which to hang yourself), but still get within spitting distance of XGBoost.

usgroup · 2024-07-26T06:19:16

xgb.cv(X,y) all-the-things.

Its far from a principled statistical approach to what is supposed to be at least an observational study in every case —- and have all the caveats and preparations thereof —- but it is very easy to explain to devs, and one cannot stop the tide unfortunately.

wongarsu · 2024-07-26T16:14:56

I'd replace XGBoost with either LightGBM or CatBoost. Though all three are capable of producing excellent results, and XGBoost has the advantage of not being maintained by an organization you might disagree with.

durraniu · 2024-07-26T03:24:52

What do you think about the mechanics of machine learning book by Jeremy Howard? It seems to have a systematic approach to fit models and improve them.

code_biologist · 2024-07-26T07:13:08

At risk of being excessively cheeky, I think you should throw it into xgboost and not worry about it.

I love Jeremy Howard, but if you're doing applied ML on tabular data, tweaking specifics of model construction probably won't get you anything beyond stock xgboost. The concerns highlighted by your sibling commenter jamesblonde end up eating most of your time in deployed ML products, not model construction.

If you're looking to learn about ML model construction I'm sure it's a great book.

antipaul · 2024-07-26T15:34:12

Like others say, usually just go for XGBoost, which is increasingly even shown (proven?) in literature, eg: https://arxiv.org/abs/2106.03253

You can also start with scikit-learn's `LogisticRegressionCV`, if a linear model is more palatable to your audience.

The bigger challenge is to reliably estimate how good your model is.

It's not about getting the best performance – it's about getting the "real" performance. How will your model _really_ do on future unseen datasets?

The answer to this challenge is cross-validation. But what are the questions?

There are 2 very different questions for which the answer is cross-validation.

One is, which hyper parameters to use with your model?

The second is, what is the generalization performance of the model?

This requires 2 separate applications (loops) of cross-validation. The authors of this book talk about this in terms of having a "validation set" and a "test set" (Sometimes, these terms are switched around, and there is also "holdout set". It's critical to know how you, and the rest of your team, are using these terms in your modeling)

A robust way to implement these 2 CV's is with nested cross-validation, readily available in many packages – and also, should be "fast enough" using modern computers.

One exercise that remains is: with nested CV, which model do you pick as your "production" model?

That is also a bit tricky. Reading things like the following can help: https://stats.stackexchange.com/q/65128/207989

EDIT: for those inclined, here is a paper on why you need 2 loops of CV: "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation" by Cawley and Talbot: https://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley1...

wongarsu · 2024-07-26T16:08:32

> One exercise that remains is: with nested CV, which model do you pick as your "production" model?

Why even pick one, just make an ensemble out of all of them. Unless you need your model to remain blind to one test or validation set for some reason, but then you should probably choose at random anyways to avoid bias.

Of course this being HN somebody is going to mention some case where they only have the CPU or memory budget for one model. But in general making ensembles of multiple XGBoost models is a great way to get better model performance.

jdeaton · 2024-07-26T13:47:26

Spoiler: just use xgboost and you’re done

mjhay · 2024-07-26T13:18:34

I appreciate the section on independent component analysis (ICA). It's not well known and very underused. In my experience, it usually works great on heterogeneous tabular data - which PCA usually does poorly on.

axpy906 · 2024-07-26T01:37:06

Wow, those are some names I’ve not seen in awhile. I wonder how LLMs do at R?

durraniu · 2024-07-26T03:26:56

There are multiple packages in R for using LLMs. One such package is tidychatmodels.

axpy906 · 2024-07-27T00:56:56

Ah, I meant how well LLMs can write R.

durraniu · 2024-07-27T01:02:02

GPT4o is pretty great at writing R in my experience. Particularly for data wrangling. But also shiny apps.

revskill · 2024-07-26T15:29:03

Too many texts is annoying to read and understand. Bad representation.