As it can be seen in the code, in the first version, there is no original work is included, Instead, I just followed the logistic regression implementation in this Kaggle Kernel provided by Baligh Mnassri Model is deployed to free hosted Heroku application which can be found here. The API provides the prediction for parameters that are provided in the URL query string (by default since no data is provided, missing values are imputed). Some simple examples include:
The following list includes required methods, steps and ideally each should be independent and coded in separate methods. The list is not sorted according to importance nor strict implementation order while I tend to follow this order. Personally, I think it is a more iterative process and requires many back and forth between the steps:
- Documentation of all: It is done in
pmd
file that dynamically produces the html documentation. - Reading data: This is simply done by
pandas.read_csv
in this repo, but it is likely required to have a method to read data from (from csv files or different kinds of sources). - Data splitting (training vs. test set): Here we only work with training data that is provided offline and only once.
- Data exploration: Baby steps including seeing the column names, data types, etc.
- Data exploration: Plots on each variable, the correlation between columns, etc. These should be included in the documentation
- Calculating replacements for missing values. This should be likely done for all variables even if nothing is missing in training data.
- Store replacement values in a json file.
- Preprocessing data (All the following are better to be independent even though they are included in the same method at the moment):
- Imputing missing values (by using json produced before)
- Feature engineering.
- Dropping unnecessary variables.
- Feature selection: Done manually, requires a method to automate the process.
- Training model:
- Creation of the model
- Storing the model in a pickle file (in an independent script/process)
- Evaluation of model: Choosing performance criteria
- Calculate performance metrics
- Store performance metrics for each model iteration in an independent script/process
- An API endpoint to get new (raw) data points and provide a prediction
- Make every part of the project as independent as possible: (Combine than all in a shell script)
- Store some data points at this point so after each iteration we know what is happening
- Generalize this approach with the very simple initial model that will work any training data. So even the first version of the implementation should give predict some results (needless to say, nothing perfect) This also means that from the first time we deploy the algorithm, it will be available even if the results are not good at all