Skip to content

Building towards preprocessing mushrooms.csv to scoring a XGBoost model

Notifications You must be signed in to change notification settings

letsql/xgboost-udf-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[WIP] DATAFUSION XGBOOST PREDICTION EXAMPLE

This is an example of using DataFusion to add a UDF to do one-hot encoding and run an XGBoost predict UDF.

SQL

SELECT predict(cap_shape,cap_surface,cap_color,bruises) as predictions FROM 
  (SELECT onehot(arrow_cast(cap_shape, 'Dictionary(Int32, Utf8)')) as cap_shape, 
          onehot(arrow_cast(cap_surface, 'Dictionary(Int32, Utf8)')) as cap_surface, 
          onehot(arrow_cast(cap_color, 'Dictionary(Int32, Utf8)')) as cap_color, 
          onehot(arrow_cast(bruises, 'Dictionary(Int32, Utf8)')) as bruises 
  FROM mushrooms);

The predict UDF loads a already trained XGBoost model from disk.

BENCHMARKS

This benchmark converts 4 columns into 22 and scores 8124 rows from Mushrooms datasets and outputs RecordBatch.

onehot UDF
Time 48.59 µs

As a naive comparison, to do get_dummies in Python/pandas land is around 40x slower with ~1.97ms

About

Building towards preprocessing mushrooms.csv to scoring a XGBoost model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages