-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"UK" doesn't seem to be registered as input #456
Comments
The minimum token length is set to 3 in the Analyzer base class, from which other analyzers such as Of course it's possible to change this, but it could add a lot of noise and increase the size of models. Any ideas? |
Makes sense. |
It would also be possible to tweak the is_valid_token method, for example so that all caps words are treated differently (e.g. minimum length 2 for all caps). Heuristics like this are bound to be imperfect - even very short words may have important meaning. For example in Swedish "ö" means island and "å" means river :) |
I feared that making exceptions would open the flood gates for even more exceptions. |
I think that the best way to approach this is to try to make an adjustment, then benchmark the results before and after the change, for example with an Omikuji model, on a couple of different data sets. In this case important metrics could be model size, training time and RAM, precision, recall, F1 score etc. |
Made some experiments regarding timing and memory used for min token sizes two and three. I did so when no other programs were running and disabled turbo boost to have less impact by thermal throttling. [omikuji-yso-en]
name=omikujiy YSO english
language=en
backend=omikuji
vocab=yso
analyzer=snowball(english)
cluster_k=100
max_depth=3 I used the YSO-Finna title data set from the tutorial. Here are the major stats
I need to retrain for three as I did not save the files
More DetailsMin Token Size 3Training
Evaluation Metrics
Evaluation Resources
Min Token Size 2TrainingUser time (seconds): 15162.35
Evaluation Resources
|
It looks like there was a small improvement in precision/recall/F1 and a small increase in train and eval times with minimum token length set to 2 instead of 3. Somewhat surprisingly, memory use (both train and eval) decreased, but this can vary between runs and the difference was very small anyway. Based on this, do you think it would be a good idea to set the token size to 2 globally @mo-fu? Should we do more tests first? It would also be possible to
|
According to the numbers for omikuji I would say the token size could be set to two globally. But maybe TFIDF should also be checked for two reasons:
|
Yes, it's a good idea to test with |
Here are the training results for tfidf: Most Important Metrics:
Min Size 2
Min Size 3
|
Evaluation results for tfidf report mostly the same: NDCG scores
Timing and MemoryMetric| Min Token Size 2 | Min TokenSize 3| Min Size 2
Min Size 3
|
Looks promising! Care to make a PR @mo-fu ? It should be pretty trivial... |
Closed by (already merged) #468. |
During our 2020 evaluation we discovered that the STW concept for United Kingdom was often not assigned, even though UK was present in the input. Changing it to U.K. helped. This is also true for the web UI on https://ai.finto.fi/ for the YSO English. Just try these two short strings. Longer sentences show the same effect.I suppose something happens during preprocessing, that removes the two letter word.
The text was updated successfully, but these errors were encountered: