MAHA is an in-progress ETL package which uses machine learning to clean your dataset with one line command. Features of MAHA include :-
- Drop all the index columns
- Drop columns with too many missing values
- Using Regression to find the missing values in the data and then replacing them
- Data is in pandas DataFrame format
- All the categorical variables are label encoded
- All the columns are in the desired data type of the output
You can also:
- Find the mean and mode of every column
- Fill the NA values with mean and mode of the columnns depending on the datatype
- Find a model for every column with all other columns being the independent variables
MAHA uses a number of open source projects to work properly:
- NumPy - NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- Pandas - Pandas is a software library written for the Python programming language for data manipulation and analysis.
- Sklearn - Machine Learning library which includes various classification, regression and clustering algorithms
MAHA requires pandas, numpy and sklearn
Use pip to install the packages
$ pip3 install pandas
$ pip3 install numpy
$ pip3 install sklearn
If you have not installed pip, you can do it by
$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
Then run the following command where you have installed get-pip.py
$ python get-pip.py
Developed By :- Mithesh R, Arth Akhouri, Heetansh Jhaveri, Ayaan Khan
Want to contribute? Navigate to our GitHub for more information GitHub Repository - MAHA
MIT