SortingHatInf is a library that implements ML-based feature type inference as seen in the paper here. Feature type inference is the task of predicting the feature types of the columns of a given dataset.
numeric
categorical
datetime
sentence
url
embedded-number
list
not-generalizable
context-specific
Same as SortingHat except:
numeric
mapped tointeger
orfloating
categorical
mapped toboolean
if Boolean
Nominal-specification
(Categorical)INTEGER
REAL
(Float)STRING
IGNORE
(Not-Generalizable)
get_sortinghat_types(df: pd.DataFrame) -> List[str]
returns a list of the predicted SortingHat feature types on the columns of the specified Pandas dataframe
Ex. infer_sh = sortinghatinf.get_sortinghat_types(df)
> infer_sh
> [
> 'COL_TYPE_1',
> 'COL_TYPE_2',
> ...
> ]
get_expanded_feature_types(df: pd.DataFrame) -> List[str]
returns a list of the predicted SortingHat feature types on the columns of the specified Pandas dataframe mapped to the expanded types
Ex. infer_exp = sortinghatinf.get_expanded_types(df)
> infer_exp
> [
> 'COL_TYPE_1',
> 'COL_TYPE_2',
> ...
> ]
get_feature_types_as_arff(df: pd.DataFrame) -> Tuple[List[Tuple[str, Union[str, List[str]]]], List[str]]
returns the predicted SortingHat feature types mapped to the loose ARFF types and the original predicted SortingHat feature types
Ex. infer_arff, infer_sh = sortinghatinf.get_expanded_types(df)
> infer_arff
> [
> ('COL_NAME_1', ['POSSIBLE_VALUE_1', 'POSSIBLE_VALUE_2', ...]), # NOMINAL
> ('COL_NAME_2', 'INTEGER'), # INTEGER
> ('COL_NAME_3', 'FLOAT'), # REAL
> ('COL_NAME_4', 'STRING'), # STRING
> ('COL_NAME_5', 'IGNORE'), # IGNORE
> ...
> ]
Note: Because ARFF expects a string list for categorical features, columns discovered to be categorical should be converted to string. This function will report these columns with an error.
Here, we run feature type inference on a dataset obtained from OpenML. Note: this can be done with any dataset loaded as a Pandas dataframe, but we use OpenML here as an example.
- First ensure
pip
,wheel
, andsetuptools
are up-to-date.
python -m pip install --upgrade pip setuptools wheel
- Install the package using python-pip.
pip install sortinghatinf
- Import the library.
import sortinghatinf
- Install the OpenML python API.
pip install openml
- Import the OpenML python library.
import openml
- Load the 'Blood Transfusion Service Center' dataset from OpenML (dataset_id=31). Note: This requires an OpenML account which you can setup by following this link.
data = openml.datasets.get_dataset(dataset_id=31)
X, _, _, _ = data.get_data() # Loaded as Pandas dataframe
- Infer the feature types for the data columns.
# Infer the SortingHat feature types.
infer_sh = sortinghatinf.get_sortinghat_types(X)
# Infer the expanded feature types.
infer_exp = sortinghatinf.get_expanded_feature_types(X)
# Infer the ARFF feature types.
# The function `get_feature_types_as_arff()` also returns the SortingHat feature types.
infer_arff, infer_sh = sortinghatinf.get_feature_types_as_arff(X)