You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support Table objects for single-row inputs. (Is this efficient?)
Support sparse categorical data in Table objects.
Generalize format.py to optionally create sparse categorical data.
Support sparse categorical data in training.
Support sparse categorical data in serving.
Why?
Categorical features with many categories are useful for modeling random effects, e.g. modeling the zip code of voters, or the clinician id of medical diagnoses.
Also, support for two feature types (initially two very similar feature types) will make it much easier to support a third type such as real/normal data in #22 .
How?
The main bottleneck in TreeCat is space and time used to process internal ragged data:
categoricals are represented as one-hot vectors in a ragged array, hence take O(#cats) space
these one-hot vectors are processed cell-by-cell in both training and serving
To reduce these costs, the internal format can split from one datatype (multinomial) to two datatypes (multinomial and categorical), where categorical data is restricted to zero or one observation. Some plumbing already exists to pass a feature_types vector to the trainer, and to represent internal data as a Table object with heterogeneous data.
The text was updated successfully, but these errors were encountered:
training.py
to useTable
objects.serving.py
to useTable
objects.Table
objects for multi-row inputs.Table
objects for single-row inputs. (Is this efficient?)categorical
data inTable
objects.format.py
to optionally create sparsecategorical
data.categorical
data in training.categorical
data in serving.Why?
Categorical features with many categories are useful for modeling random effects, e.g. modeling the zip code of voters, or the clinician id of medical diagnoses.
Also, support for two feature types (initially two very similar feature types) will make it much easier to support a third type such as real/normal data in #22 .
How?
The main bottleneck in TreeCat is space and time used to process internal ragged data:
O(#cats)
spaceTo reduce these costs, the internal format can split from one datatype (multinomial) to two datatypes (multinomial and categorical), where categorical data is restricted to zero or one observation. Some plumbing already exists to pass a
feature_types
vector to the trainer, and to represent internal data as aTable
object with heterogeneous data.The text was updated successfully, but these errors were encountered: