Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support categorical data with many categories #29

Open
1 of 8 tasks
fritzo opened this issue Sep 14, 2017 · 0 comments
Open
1 of 8 tasks

Support categorical data with many categories #29

fritzo opened this issue Sep 14, 2017 · 0 comments

Comments

@fritzo
Copy link
Member

fritzo commented Sep 14, 2017

  • Refactor training.py to use Table objects.
  • Refactor serving.py to use Table objects.
    • Support Table objects for multi-row inputs.
    • Support Table objects for single-row inputs. (Is this efficient?)
  • Support sparse categorical data in Table objects.
  • Generalize format.py to optionally create sparse categorical data.
  • Support sparse categorical data in training.
  • Support sparse categorical data in serving.

Why?

Categorical features with many categories are useful for modeling random effects, e.g. modeling the zip code of voters, or the clinician id of medical diagnoses.

Also, support for two feature types (initially two very similar feature types) will make it much easier to support a third type such as real/normal data in #22 .

How?

The main bottleneck in TreeCat is space and time used to process internal ragged data:

  • categoricals are represented as one-hot vectors in a ragged array, hence take O(#cats) space
  • these one-hot vectors are processed cell-by-cell in both training and serving

To reduce these costs, the internal format can split from one datatype (multinomial) to two datatypes (multinomial and categorical), where categorical data is restricted to zero or one observation. Some plumbing already exists to pass a feature_types vector to the trainer, and to represent internal data as a Table object with heterogeneous data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant