skrub

skrub (formerly dirty_cat) is a Python library that facilitates prepping your tables for machine learning.

If you like the package, spread the word and ⭐ this repository!

What can skrub do?

skrub provides data assembling tools (TableVectorizer, fuzzy_join...) and encoders (GapEncoder, MinHashEncoder...) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations

See our examples.

What skrub cannot do

Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.

This kind of problem is tackled by Natural Language Processing methods.

skrub can still help with handling typos and variations in this kind of setting.

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].

Installation (WIP)

There are currently no PiPy releases. You can still install the package from the GitHub repository with:

pip install git+https://github.com/skrub-data/skrub.git

Dependencies

Dependencies and minimal versions are listed in the setup file.

Related projects

Are listed on the skrub's website

Contributing

The best way to support the development of skrub is to spread the word!

Also, if you already are a skrub user, we would love to hear about your use cases and challenges in the Discussions section.

To report a bug or suggest enhancements, please open an issue and/or submit a pull request.

Additional resources

References

[1]	Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.

[2]	Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.

Name		Name	Last commit message	Last commit date
Latest commit History 1,464 Commits
.binder		.binder
.circleci		.circleci
.github		.github
benchmarks		benchmarks
build_tools		build_tools
doc		doc
examples		examples
skrub		skrub
.coveragerc		.coveragerc
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGES.rst		CHANGES.rst
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE.txt		LICENSE.txt
README.rst		README.rst
RELEASE_PROCESS.rst		RELEASE_PROCESS.rst
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skrub

What can skrub do?

What skrub cannot do

Installation (WIP)

Dependencies

Related projects

Contributing

Additional resources

References

About

Releases

Packages

Languages

License

MarcoGorelli/skrub

Folders and files

Latest commit

History

Repository files navigation

skrub

What can skrub do?

What skrub cannot do

Installation (WIP)

Dependencies

Related projects

Contributing

Additional resources

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages