Skip to content

Latest commit

 

History

History

xgboost_evaluator

SIG TFX-Addons

Project Proposal

Your name: Daniel Kim

Your email: [email protected]

Your company/organization: Twitter

Project name: XGBoost Evaluator Component

Project Description

Add support for evaluating XGBoost model by extending the standard component Evaluator. Add an example pipeline that trains, evaluates and pushes an XGBoost model to CAIP.

Project Category

Component + Example

Project Use-Case(s)

This project can be used whenever customers wish to evaluate XGBoost models within a TFX pipeline, in order to obtain the various benefits and functionalities that TFX supports.

Project Implementation

To make the Evaluator works with XGBoost models, we can customize the Evaluator by providing a Python module with:

From Trainer to Evaluator: save and load XGBoost model

Option 1 (chosen): working with XGBoost library directly

The XGBoost library provides a few different ways to save a model (an xgb.Booster or xgb.sklearn.XGBModel object). Backward compatibility is guaranteed in most cases. Currently, the 2 main supported formats are:

  • XGBoost internal binary format. Note that Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format.
  • JSON: newer format aiming to replace the binary format

For maximum compatibility, we want the Trainer component to output both formats at the expected output directory, and can provide a helper function, which takes in a Booster object then writes model.bin and model.json to the expected directory. The Evaluator uses the latest version of the xgboost library to read model.json - this will be implemented in UDF custom_eval_shared_model(). This way, we can expect the loaded model object to have most necessary information retained.

Option 2: using sklearn Pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline as SkPipeline

classifier = xgb.XGBClassifier(**params)
model = SkPipeline([
    ('scaler', StandardScaler()),
    ('classifier', classifier),
])
model.fit(x_train, y_train)

# you can choose to save just the XGBClassifier:
model.steps[1][1].save_model(...)

However, it’s more likely that you’ll need the whole sklearn Pipeline in downstream evaluation. There are 2 methods to save and load sklearn Pipeline object:

Using joblib:

import joblib
joblib.dump(pipeline, 'model.joblib')

Note that CAIP asks users to use sklearn.externals.joblib rather than the bare joblib, but newer versions of sklearn have deprecated skearn.externals.

Using pickle:

import pickle
with open('model.pkl', 'wb') as model_file:
    pickle.dump(pipeline, model_file)

The main downside of working with sklearn Pipeline is potentially losing portability, this will be discussed further in the next section.

Custom Extractor for the Evaluator

The Evaluator component will also utilize a custom prediction extractor, which would load and run our EvalSharedModel(s) on given examples. Xgboost models cannot accept tf.Examples as input, so they will have to be converted within the function.

Our custom prediction extractor essentially governs conversion of data to formats that xgboost can accept, extraction of the necessary features of the data, the actual prediction, and framework code supporting all of these operations. It will be passed into (this)[https://tensorflow.google.cn/tfx/model_analysis/api_docs/python/tfma/default_extractors] tfma.default_extractors function for use in the Evaluator.

Currently, we plan to support running the Evaluator with Apache Beam through the use of a customized prediction DoFn to load, process, and run predictions on models, and a simple pipeline wrapper that calls this function on extracts.

The actual implementation of the custom prediction extractor depends on whether it receives a native XGBoost serialized model (option 1 from above), or a pickled sklearn Pipeline (option 2 from above). Here are pros and cons of each option.

Option 1 (chosen):

  • Pros:
    • Universal among the various XGBoost interfaces (Python, JVM, C++, etc.)
    • Some level of backward compatibility is guaranteed
    • Still retain attributes such as feature_names, feature_types, etc. (in newer xgboost versions)
  • Cons:
    • Lack ability to combine with or substitute in other sklearn models

Option 2:

  • Pros:

    • Another wrapping layer means more flexibility, you can add some pre-processing and post-processing to the sklearn Pipeline, try out other types of models, etc.
    • Most of the code needed for sklearn-compatible Trainer and Evaluator in the penguin sklearn pipeline can be reused
  • Cons:

    • Extra dependency on sklearn
    • Using Python pickle standard library or joblib, which is specific to Python
    • Lack of guarantee for backward compatibility

Summary: we will go with option 1 for simplicity and broader compatibility across different XGBoost interfaces.

Open questions: From training performance view point, is there a difference between using native xgboost vs using sklearn Pipeline?

Testing and Example Pipeline

In the same spirit as https://github.com/tensorflow/tfx-addons/blob/main/proposals/20210404-sklearn_example.md, we will add an example pipeline that run locally, and another version that runs on GCP using Vertex AI Pipelines.

Example pipeline will have its end to end local unit test.

The model can be pushed to CAIP. The current CAIP runtime version 2.5 runs XGBoost 1.4.0. Training, serialization, and deserialization XGBoost models using different versions of the library is allowed. In other words, the XGBoost library guarantees some level of backward compatibility.

This example pipeline will not be packaged, instead, users just need to clone the source code to run the example.

Project directories

  • tfx_addons/xgboost_evaluator (evaluator code and tests)
  • examples/xgboost_penguins (example pipelines and tests)

Project Dependencies

  • xgboost>=1.4.0

References

Project Team

Daniel Kim, kindalime, [email protected]

Vincent Nguyen, cent5, [email protected]