Easysklearn is a python package designed to perform exploratory data analysis, to help with missing data imputation and to give baseline models. Also, it assists in feature selection which is a common problem when undertaking a data science or machine learning analysis. As its name indicates, this function operates like sklearn. It carries out tasks such as splitting data, feature selection, model fitting, numerical missing data imputation etc.
$ pip install -i https://test.pypi.org/simple/ easysklearn
This package introduces a data science enthusiast, with little to no knowledge of machine learning, to the common steps required when undertaking a Supervised learning analysis. The package contains four functions that accept a pandas DataFrame. All functions can be used on a dataset with numerical features. The functions might have their own required and optional arguments.
-
eda: The
eda
function will split the original data into train and test dataset and will generate a statistical report such as correlation between the variables, number of missing data, class imbalance and type of data present in the dataset. -
miss_data: The
miss_data
function will handle missing numerical data in the data frame. -
baseline_fun: The
baseline_fun
function will give users a quick check of the performance of the selected sklearn models compared to a baseline model, upon which the model can be improved. -
feature_select: The
feature_select
function will remove redundant features based on the forward selection.
To our knowledge, while pandas profiling provides some data statistical analysis, there is no general-purpose library for performing the aforementioned tasks together in the Python ecosystem.
- python = "^3.8"
- pandas = "^1.2.3"
- numpy = "^1.20.1"
- matplotlib = "^3.3.4"
- sklearn = "^0.0"
- seaborn = "^0.11.1"
- ipython = "^7.21.0"
- jupyter = "^1.0.0"
Task | Function |
---|---|
Exploratory data analysis | eda(df, target) |
Numerical data imputation | miss_data(x_train, x_test, strategy="mean") |
Compare selected model to a baseline model | baseline_fun(X_train, y_train, type="regression", metrics_1="accuracy", metrics_2="r2") |
Feature selection to reduce data dimension | feature_select(X_train, y_train, threshold=0.05) |
from easysklearn import eda, baseline_fun, miss_data, feature_select
example_df = pd.DataFrame(np.array([[4500, np.nan, 4], [3450, 350_000, 6],
[np.nan, 800_000, 9]]), columns = ['size', 'price', 'bedrooms'])
- eda = eda(example_df, target)
- x_train, x_test = miss_data(x_train, x_test, strategy="mean")
- results = baseline_fun(X_train, y_train, type="regression", metrics_1="accuracy", metrics_2="r2")
- best_features = feature_select(X_train, y_train, threshold=0.05)
The official documentation is hosted on Read the Docs: https://easysklearn.readthedocs.io/en/latest/
Development leaders:
- Ifeanyi Anene
- Lara Habashy
- Sakshi Jain
- Zhenrui Yu
We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab. If you would like to contribute, please view our contributing guidelines and get familiar with the Github flow workflow.
This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.