This tutorial will help you get started on using the data science production container. Firstly an explanation of the framework and it's components will be done. Secondly a hands-on exercise to get a predictive model into a running docker container.
The framework helps you to create a simple HTTP API which can expose predictive models. It has only one endpoint which is /predict
. The framework is built on top of Flask. It has a number of abstract classes which need to be implemented and passed to an API object:
- Abstract class Model: This abstract class is used to handle the predictive model. The idea is load your model from some source (e.g. in the example below we'll load a pickled file from disk). And use the model to do predictions. Training the model is thus not in scope of this framework and should be done before implementation.
- Abstract class FeatureExtractor: This abstract class is used to transform the HTTP request body to a feature vector which is usable for the model. An implementation could be just passing the body as a whole, or maybe a database lookup for some additional features etc.
- The API class, this class is the real Flask wrapper and is constructed using a Model and FeatureExtractor implemented class. On
POST
call to the/predict
endpoint the message body will be parsed to a dict, sent to the feature extractor, and the features will then be passed to the model to do the real prediction. This prediction is finally returned to the client.
This exercise will help you get started using the framework. The goal is to bring a house price prediction model into production.
Building the model is beyond the scope of this exercise. Therefore a simple regression model was created based a Kaggle dataset Kaggle dataset. It is created with the scikit-learn package and can be found in the Exercise\models
directory as a pickle file.
The model predicts the price of a house based on two features, the lot area and the year the house was built. This is obviously not the most accurate model but is sufficient to demonstrate the working of the framework.
This exercise will be based of the Exercise directory, a number of files have been prepared, during the exercise all steps will be explained.
The main.py
file will be used for the implementation of the framework. In order to implement it, we first need to import the framework definitions.
At the top of the file create the import statements.
from ds_prod_api.apis.FlaskApi import FlaskApi
from ds_prod_api.abstracts import FeatureExtractor
from ds_prod_api.abstracts import Model
Before creating the API we need to implement a Model class and a FeatureExtractor, first start with the Model.
In the main.py
file add a new class definition which implements Model
. Also implement 3 abstract functions:
- load(self): This function is used to initialize the model. In our implementation we need to depickle the model from the models directory.
- predict(self, features) -> return prediction: This function is used on incoming API calls. Implement it such that the feature_vector is passed to the predict function of the depickled model, and return the response.
- default(self) -> return default_value: This function is used when there is an error during handling the API call, and then return a default value (e.g. average house price).
Implement the FeatureExtractor abstract class. Als implement one abstract function:
- get_features(self, request_body_dict)-> return features: This function is called on every api call to create a feature vector. The request body is passed to this function. Implement it such that a pandas Dataframe is create from the request body. (hint:
from pandas import DataFrame as df
and use thedf.from_dict()
function)
Now we have implemented all necessary components and can create, and run the api. Create an object for the implemented model and feature extractor and pass these to the constructor of the api.
model = BaselineLmModel()
extractor = BaselineFeatureExtractor()
model.load()
api = FlaskApi(model, extractor)
api.run()
Add the dependencies of your project to the environment.yaml
file. This file will be used when building the docker image to create a virtual environment. It will look something like this:
name: ds_prod
dependencies:
- flask
- pip:
- sklearn
- dill
- numpy
- pandas
- scipy
- git+https://github.com/BigDataRepublic/bdr-engineering-stack.git@Feature/model_docker#subdirectory=data-science-production-container/ds_prod_api
We have now implemented all components and can build the docker image. In the commandline run:
docker build . -t MyFirstDSProdContainer
During the docker build a number of steps will be taken. The base image for this container is the Miniconda docker image, this has conda preinstalled. Then all the files from the directory will be added to the container. A conda env will be created based on the environment.yaml
file.
Run the docker image, and expose port 5000.
docker run -p5000:5000 MyFirstDSProdContainer
Send a POST call to the api, and check the reponse. The model takes two feaures: LotArea
and YearBuilt
.
curl \
-H "Content-Type: application/json" \
-X POST \
-d '[{"LotArea": 200,"YearBuilt":1978}]' \
https://localhost:5000/predict
Alternatively, you can use a GUI like postman.