Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Spanish language #1

Closed
wants to merge 6 commits into from
Closed

Support Spanish language #1

wants to merge 6 commits into from

Conversation

pr3ssh
Copy link
Contributor

@pr3ssh pr3ssh commented Jun 1, 2020

I added es.tar.gz as an optional languages and also added unit tests strings but for some reason the Speller does not work properly.
For creating es.tar.gz, I folloewd the steps that appears on README file.
Any idea what can be wrong?

@filyp
Copy link
Owner

filyp commented Jun 2, 2020

I have one guess. Make sure you have es.tar.gz BOTH in optional_languages and autocorrect/data. Speller first looks for it in autocorrect/data, and if it's not there, it tries to download from optional_languages on master. It's not merged to master yet so it will fail.
If that won't help, paste any output you have, and some way to reproduce.

@pr3ssh
Copy link
Contributor Author

pr3ssh commented Jun 2, 2020

Before this PR, my first (local) attempt was to put es.tar.gz into data folder in order to test

spell = Speller(lang='es')
spell('hloa')

but the output was hloa istead of hola (Spanish hello word).


README.md file suggest to use

count_words('eswiki-latest-pages-articles.xml', 'ru')

for getting Wikipedia Spanish words. I think that;s incorrect due I'm adding Spanish language. I changed by

count_words('eswiki-latest-pages-articles.xml', 'es')

That's the only change I did in the process of adding new language.

@filyp
Copy link
Owner

filyp commented Jun 2, 2020

Ah, ok. The issue is probably, that the word 'hloe' exists in wikipedia, so the Speller doesn't try to correct it. The way I fixed it for other languages, was to cut out rarely used words. You can do it by calling for example:

spell = Speller(lang='es', threshold=4)

To use only words which appeared at least 4 times in wikipedia. You'll have to find the right threshold value empirically. After that, you can manually delete all those rare words from the file in es.tar.gz (it's already sorted so it should be easy).
Later, I will update this section about adding new languages, because this step is important.

@pr3ssh
Copy link
Contributor Author

pr3ssh commented Jun 2, 2020

With the new threshold...

Original number of words: 12196114
After applying threshold: 288623

I'm not really sure if it;s a lot but I tested some words and the Speller does not work properly with fewer threshold values.

@filyp
Copy link
Owner

filyp commented Jun 2, 2020

For other languages I set it smaller, like 4, but I think that Spanish has less variants of the same words, and also Spanish wiki is probably larger. So as long as it works fine on unit tests it's fine.

@filyp
Copy link
Owner

filyp commented Jun 2, 2020

I noticed es.tar.gz isn't stored in LFS, and I'd like to avoid bloating repo size. It probably happened because you forked before I set it up. You should be able to migrate it to LFS by running:

git lfs migrate import --include="*.tar.gz" --include-ref=refs/heads/master

And then force push.

@filyp
Copy link
Owner

filyp commented Jun 4, 2020

It turned out LFS has a 1GB limit, after that it's paid and I've used up almost all of it. Also, there is no way to delete old, unnecessary files! :c I'll have to find some other way to store those tar.gz's. Storing them as regular files, without LFS is even worse, because there is a 500MB limit. I'll probably just put them in google drive. If you know of some better way let me know :)

@pr3ssh
Copy link
Contributor Author

pr3ssh commented Jun 5, 2020

🤔 Google Drive or any other server you have (HTTP or FTP). Good luck with that 🤞

@filyp
Copy link
Owner

filyp commented Jun 11, 2020

Hi, I can't download es.tar.gz anymore, so could you mail it to me to [email protected]? I will add it to my google drive.

@pr3ssh
Copy link
Contributor Author

pr3ssh commented Jun 11, 2020 via email

@filyp
Copy link
Owner

filyp commented Jun 11, 2020

I can't, when I follow the link, it only gives LFS reference:

version https://git-lfs.github.com/spec/v1
oid sha256:cad1ce706de6f7f84e420ece653af8d0ade59774c9bab12cdb0350e8f3b1a32a
size 1757679

@pr3ssh
Copy link
Contributor Author

pr3ssh commented Jun 11, 2020 via email

@filyp
Copy link
Owner

filyp commented Jun 12, 2020

OK, no hurry. Sorry for the lost data. Did you loose this tar.gz too?

@pr3ssh
Copy link
Contributor Author

pr3ssh commented Jun 12, 2020 via email

@filyp
Copy link
Owner

filyp commented Jun 12, 2020

:< I know this pain, happened to me last month too

@filyp
Copy link
Owner

filyp commented Jun 21, 2020

I merged your changes in ec15a64 instead of merging this pull request, to avoid adding es.tar.gz to the repo. I added that es.tar.gz you sent me to google drive.

Thank you for contributing :)

@filyp filyp closed this Jun 21, 2020
@pr3ssh
Copy link
Contributor Author

pr3ssh commented Jun 21, 2020

@fsondej it was a pleasure ;)

@filyp filyp mentioned this pull request Jul 6, 2020
filyp pushed a commit that referenced this pull request Dec 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants