Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial learn #15

Closed
Slavenin opened this issue Jan 19, 2021 · 10 comments
Closed

Partial learn #15

Slavenin opened this issue Jan 19, 2021 · 10 comments
Labels
question Further information is requested

Comments

@Slavenin
Copy link

Hi!
I have dataset on 900k records with 800 categories. But I can not train my model because 16gb RAM not enough.
How I can train my model by part?

@angrymeir
Copy link
Collaborator

angrymeir commented Jan 19, 2021

Hi @Slavenin,

you can split up your training set and train on them sequentially.
I created a small Gist (which you can run in colab) that shows that it doesn't make a difference if you just use the plain train function.

However, I'm pretty sure, that for more advanced training such as hyperparameter search this approach might not be applicable. Maybe @sergioburdisso could elaborate a bit on that 😇?

@angrymeir angrymeir added the question Further information is requested label Jan 19, 2021
@Slavenin
Copy link
Author

Tnx! It works.
But I have an error in the categories print
image

@sergioburdisso
Copy link
Owner

Hi @Slavenin! that's weird, what type of labels are you working with? It would be nice if we could replicate this error locally so that we can fix it. Is it just a problem with the pring_category_info() function? the rest of the code worked well?

You can also use the learn function to train the model "incrementally". By using "learn" instead of "train" you can speed up the process by disabling the update of the model that is performed automatically after each call to train, as illustrated below:

# create a single "huge" document for each category by concatenating each of its documents
# then call the learn function for each one of the categories using "update=False"
clf.learn(huge_doc_cat_1, label_cat_1, update=False)
clf.learn(huge_doc_cat_2, label_cat_2, update=False)
....
clf.learn(huge_doc_cat_n, label_cat_n)  # <-- note that the last category shudn't use the "update=False" so that the model is finally updated

Of course, you can use a loop to implement the above code, I wrote it that way just to make the explanation simpler.

As pointed out by @angrymeir, when working with a big dataset, it is better to perform hyperparameter optimization using a sub-sampling of the dataset. For instance by using the stratified k-fold function of sklearn and then working with just a single fold (subset) to optimize the model (Note we're using "stratified" here to make sure at least one sample of each category is included in each split, in fact, it will try to fit the same amount of samples for each category in each training subset/split/fold).

Nevertheless, it is in the "TODO" list to the optimization of the current source code to be robust in relation to the size of the used dataset, especially in relation to the number of categories, for instance by using NumPy data structures (I have some work done on this regard but there's still work left to do).

(Thanks, @angrymeir for your valuable help, you rock buddy! 💪).

@angrymeir
Copy link
Collaborator

Ah @sergioburdisso that makes sense!

@Slavenin If you can't provide the dataset it might give some insights to see the output of the Vocab. Size per Category.
For that you could just replace the following code block with: print(category[NAME], len(category[VOCAB]))

I could imagine that you have a sample s = (x,y) in your train set, where x = "" and s the only sample in you dataset with label y thus vocab.size(y) = 0 - @sergioburdisso is that possible?

@Slavenin
Copy link
Author

Slavenin commented Jan 21, 2021

Hi @Slavenin! that's weird, what type of labels are you working with? It would be nice if we could replicate this error locally so that we can fix it. Is it just a problem with the pring_category_info() function? the rest of the code worked well?

File names are id categories. Simply numbers.
image
798 objects in folder test and train

You can also use the learn function to train the model "incrementally". By using "learn" instead of "train" you can speed up the process by disabling the update of the model that is performed automatically after each call to train, as illustrated below:

learn work fine!

@Slavenin
Copy link
Author

Ah @sergioburdisso that makes sense!

@Slavenin If you can't provide the dataset it might give some insights to see the output of the Vocab. Size per Category.
For that you could just replace the following code block with: print(category[NAME], len(category[VOCAB]))

I could imagine that you have a sample s = (x,y) in your train set, where x = "" and s the only sample in you dataset with label y thus vocab.size(y) = 0 - @sergioburdisso is that possible?

You're right.
Еhe output of the Vocab. Size per Category:
image

But I do not understand how to fix that. I have a category with only one record.

@angrymeir
Copy link
Collaborator

As far as I understand it, there's two issues.

  1. You have a sample with the label 219 for which no ngrams have been learned. As pointed out above one reason could be, that this sample is empty. So if the sample is empty why even keep it in the dataset? Another option could be, that the sample only contains characters that are not learned e.g. punctuation ( . , ! ?).

  2. One of the categories has only one record. This could impose imbalance problems. But also here without looking at your use case (e.g. hyperparameter optimisation or just using the default parameters) and the Vocab. Size distribution its hard to tell whether this will actually be a problem.

@Slavenin
Copy link
Author

  1. No this sample not empty.
    image
  2. I want to exclude samples with fewer n entries

Does your lib work with any language?

@angrymeir
Copy link
Collaborator

I think @sergioburdisso can answer that way more competent :)

sergioburdisso added a commit that referenced this issue Jan 30, 2021
-Quick fix of default compatibility with foreign languages (#15).
@sergioburdisso
Copy link
Owner

sergioburdisso commented Jan 30, 2021

Hi @Slavenin!

Does your lib work with any language?

Yes, the model works independently of the language being used, however, the default preprocessing function ignores characters outside the "standard" ones (a-zA-Z), so, to prevent this behavior you should simply disable the default preprocessing using the prep=False argument with the train and predict functions, as follows:

clf.train(x_train, y_train, prep=False)
...
clf.predict(x_test, prep=False)

I've just also made a tiny update to the source code of the preprocessing function to consider all valid "word" characters (\w) instead of just the range a-zA-Z and I've already released a new version (0.6.4) with the patch fixing this issue, so updating the package (pip install -U pyss3) should also solve this problem (basically, now, by default, the library should work with any language).

Let us know if this solved your problem, and do not hesitate to re-open this issue in case it is needed.


Regarding the size of the dataset, I would like to point out two things:

  1. When working with big datasets it is also better to perform hyperparameter optimization not using n-grams, because the code is optimized (with NumPy) for the case when n-grams=1; when n-grams>1 the library runs much slower.

  2. As mentioned by @angrymeir, it is recommended to perform some "data cleaning" before feeding the model, such as removing documents with very few words or categories with very few documents, the more "balanced" is your data in relation to all categories, the better. Categories with very few words will probably add noise to your final predictive model.

PS: I'm really sorry for the delay, I'm currently on vacation 😎 in the countryside 🐔, with very limited Internet access (and more importantly, very limited electrical power xD). Take care guys! 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants