Partial learn #15

Slavenin · 2021-01-19T11:14:12Z

Hi!
I have dataset on 900k records with 800 categories. But I can not train my model because 16gb RAM not enough.
How I can train my model by part?

angrymeir · 2021-01-19T12:25:31Z

Hi @Slavenin,

you can split up your training set and train on them sequentially.
I created a small Gist (which you can run in colab) that shows that it doesn't make a difference if you just use the plain train function.

However, I'm pretty sure, that for more advanced training such as hyperparameter search this approach might not be applicable. Maybe @sergioburdisso could elaborate a bit on that 😇?

Slavenin · 2021-01-19T13:38:03Z

Tnx! It works.
But I have an error in the categories print

sergioburdisso · 2021-01-20T16:24:08Z

Hi @Slavenin! that's weird, what type of labels are you working with? It would be nice if we could replicate this error locally so that we can fix it. Is it just a problem with the pring_category_info() function? the rest of the code worked well?

You can also use the learn function to train the model "incrementally". By using "learn" instead of "train" you can speed up the process by disabling the update of the model that is performed automatically after each call to train, as illustrated below:

# create a single "huge" document for each category by concatenating each of its documents
# then call the learn function for each one of the categories using "update=False"
clf.learn(huge_doc_cat_1, label_cat_1, update=False)
clf.learn(huge_doc_cat_2, label_cat_2, update=False)
....
clf.learn(huge_doc_cat_n, label_cat_n)  # <-- note that the last category shudn't use the "update=False" so that the model is finally updated

Of course, you can use a loop to implement the above code, I wrote it that way just to make the explanation simpler.

As pointed out by @angrymeir, when working with a big dataset, it is better to perform hyperparameter optimization using a sub-sampling of the dataset. For instance by using the stratified k-fold function of sklearn and then working with just a single fold (subset) to optimize the model (Note we're using "stratified" here to make sure at least one sample of each category is included in each split, in fact, it will try to fit the same amount of samples for each category in each training subset/split/fold).

Nevertheless, it is in the "TODO" list to the optimization of the current source code to be robust in relation to the size of the used dataset, especially in relation to the number of categories, for instance by using NumPy data structures (I have some work done on this regard but there's still work left to do).

(Thanks, @angrymeir for your valuable help, you rock buddy! 💪).

angrymeir · 2021-01-20T16:48:38Z

Ah @sergioburdisso that makes sense!

@Slavenin If you can't provide the dataset it might give some insights to see the output of the Vocab. Size per Category.
For that you could just replace the following code block with: print(category[NAME], len(category[VOCAB]))

I could imagine that you have a sample s = (x,y) in your train set, where x = "" and s the only sample in you dataset with label y thus vocab.size(y) = 0 - @sergioburdisso is that possible?

Slavenin · 2021-01-21T14:24:54Z

Hi @Slavenin! that's weird, what type of labels are you working with? It would be nice if we could replicate this error locally so that we can fix it. Is it just a problem with the pring_category_info() function? the rest of the code worked well?

File names are id categories. Simply numbers.

798 objects in folder test and train

You can also use the learn function to train the model "incrementally". By using "learn" instead of "train" you can speed up the process by disabling the update of the model that is performed automatically after each call to train, as illustrated below:

learn work fine!

Slavenin · 2021-01-22T07:43:43Z

Ah @sergioburdisso that makes sense!

@Slavenin If you can't provide the dataset it might give some insights to see the output of the Vocab. Size per Category.
For that you could just replace the following code block with: print(category[NAME], len(category[VOCAB]))

I could imagine that you have a sample s = (x,y) in your train set, where x = "" and s the only sample in you dataset with label y thus vocab.size(y) = 0 - @sergioburdisso is that possible?

You're right.
Еhe output of the Vocab. Size per Category:

But I do not understand how to fix that. I have a category with only one record.

angrymeir · 2021-01-22T10:36:56Z

As far as I understand it, there's two issues.

You have a sample with the label 219 for which no ngrams have been learned. As pointed out above one reason could be, that this sample is empty. So if the sample is empty why even keep it in the dataset? Another option could be, that the sample only contains characters that are not learned e.g. punctuation ( . , ! ?).
One of the categories has only one record. This could impose imbalance problems. But also here without looking at your use case (e.g. hyperparameter optimisation or just using the default parameters) and the Vocab. Size distribution its hard to tell whether this will actually be a problem.

Slavenin · 2021-01-25T06:50:20Z

No this sample not empty.
I want to exclude samples with fewer n entries

Does your lib work with any language?

angrymeir · 2021-01-29T12:08:47Z

I think @sergioburdisso can answer that way more competent :)

-Quick fix of default compatibility with foreign languages (#15).

sergioburdisso · 2021-01-30T21:40:39Z

Hi @Slavenin!

Does your lib work with any language?

Yes, the model works independently of the language being used, however, the default preprocessing function ignores characters outside the "standard" ones (a-zA-Z), so, to prevent this behavior you should simply disable the default preprocessing using the prep=False argument with the train and predict functions, as follows:

clf.train(x_train, y_train, prep=False)
...
clf.predict(x_test, prep=False)

I've just also made a tiny update to the source code of the preprocessing function to consider all valid "word" characters (\w) instead of just the range a-zA-Z and I've already released a new version (0.6.4) with the patch fixing this issue, so updating the package (pip install -U pyss3) should also solve this problem (basically, now, by default, the library should work with any language).

Let us know if this solved your problem, and do not hesitate to re-open this issue in case it is needed.

Regarding the size of the dataset, I would like to point out two things:

When working with big datasets it is also better to perform hyperparameter optimization not using n-grams, because the code is optimized (with NumPy) for the case when n-grams=1; when n-grams>1 the library runs much slower.
As mentioned by @angrymeir, it is recommended to perform some "data cleaning" before feeding the model, such as removing documents with very few words or categories with very few documents, the more "balanced" is your data in relation to all categories, the better. Categories with very few words will probably add noise to your final predictive model.

PS: I'm really sorry for the delay, I'm currently on vacation 😎 in the countryside 🐔, with very limited Internet access (and more importantly, very limited electrical power xD). Take care guys! 💪

angrymeir added the question Further information is requested label Jan 19, 2021

sergioburdisso closed this as completed in 58fef6d Jan 30, 2021

sergioburdisso added a commit that referenced this issue Jan 30, 2021

Release version 0.6.4

14b01bf

-Quick fix of default compatibility with foreign languages (#15).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial learn #15

Partial learn #15

Slavenin commented Jan 19, 2021

angrymeir commented Jan 19, 2021 •

edited

Loading

Slavenin commented Jan 19, 2021

sergioburdisso commented Jan 20, 2021

angrymeir commented Jan 20, 2021

Slavenin commented Jan 21, 2021 •

edited

Loading

Slavenin commented Jan 22, 2021

angrymeir commented Jan 22, 2021

Slavenin commented Jan 25, 2021

angrymeir commented Jan 29, 2021

sergioburdisso commented Jan 30, 2021 •

edited

Loading

Partial learn #15

Partial learn #15

Comments

Slavenin commented Jan 19, 2021

angrymeir commented Jan 19, 2021 • edited Loading

Slavenin commented Jan 19, 2021

sergioburdisso commented Jan 20, 2021

angrymeir commented Jan 20, 2021

Slavenin commented Jan 21, 2021 • edited Loading

Slavenin commented Jan 22, 2021

angrymeir commented Jan 22, 2021

Slavenin commented Jan 25, 2021

angrymeir commented Jan 29, 2021

sergioburdisso commented Jan 30, 2021 • edited Loading

angrymeir commented Jan 19, 2021 •

edited

Loading

Slavenin commented Jan 21, 2021 •

edited

Loading

sergioburdisso commented Jan 30, 2021 •

edited

Loading