Hacker News new | past | comments | ask | show | jobs | submit login

Like others say, usually just go for XGBoost, which is increasingly even shown (proven?) in literature, eg: https://arxiv.org/abs/2106.03253

You can also start with scikit-learn's `LogisticRegressionCV`, if a linear model is more palatable to your audience.

The bigger challenge is to reliably estimate how good your model is.

It's not about getting the best performance – it's about getting the "real" performance. How will your model _really_ do on future unseen datasets?

The answer to this challenge is cross-validation. But what are the questions?

There are 2 very different questions for which the answer is cross-validation.

One is, which hyper parameters to use with your model?

The second is, what is the generalization performance of the model?

This requires 2 separate applications (loops) of cross-validation. The authors of this book talk about this in terms of having a "validation set" and a "test set" (Sometimes, these terms are switched around, and there is also "holdout set". It's critical to know how you, and the rest of your team, are using these terms in your modeling)

A robust way to implement these 2 CV's is with nested cross-validation, readily available in many packages – and also, should be "fast enough" using modern computers.

One exercise that remains is: with nested CV, which model do you pick as your "production" model?

That is also a bit tricky. Reading things like the following can help: https://stats.stackexchange.com/q/65128/207989

EDIT: for those inclined, here is a paper on why you need 2 loops of CV: "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation" by Cawley and Talbot: https://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley1...




> One exercise that remains is: with nested CV, which model do you pick as your "production" model?

Why even pick one, just make an ensemble out of all of them. Unless you need your model to remain blind to one test or validation set for some reason, but then you should probably choose at random anyways to avoid bias.

Of course this being HN somebody is going to mention some case where they only have the CPU or memory budget for one model. But in general making ensembles of multiple XGBoost models is a great way to get better model performance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: