Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BrokenPipeError: [Errno 32] Broken pipe #1

Open
YasineNifa opened this issue May 20, 2019 · 6 comments
Open

BrokenPipeError: [Errno 32] Broken pipe #1

YasineNifa opened this issue May 20, 2019 · 6 comments

Comments

@YasineNifa
Copy link

Hello
Please I am following this tutorial to create my French Language model : https://github.com/kmario23/KenLM-training
But when I type this cmd :
bzcat ./data_final/vocabulary.txt.bz2 | python preprocess.py | /home/innovation/kenlm/bin/lmplz -o 3 > myvocabulary.arpa

I get the following error :

print(' '.join(nltk.word_tokenize(sentence)).lower())
BrokenPipeError: [Errno 32] Broken pipe
Erreur de segmentation (core dumped)
@kmario23
Copy link
Owner

Hi @YasineNifa ,
I haven't encountered such issues with English text. Have you followed the guide exactly? I'd suggest you to pay particular attention to creating a virtual environment. And maybe this discussion on: ioerror-errno-32-broken-pipe-python be helpful?

Please note that the file bible.en.txt.bz2 should be the raw text with single sentence per line. I see that you're using a vocabulary file instead..

@YasineNifa
Copy link
Author

Yeah I followed the guide but I did not execute this cmd : bzcat vocabulary.txt.bz2 | python process.py | wc because I did not find the process.py file
Yeah the vocabulary file has the same structure as bible file [raw text with single sentence per line]

@kmario23
Copy link
Owner

but I did not execute this cmd : bzcat vocabulary.txt.bz2 | python process.py | wc because I did not find the process.py file

Oh sorry. that was a typo. fixed it! Maybe do you have the data publicly available? I can try to replicate the error..

@YasineNifa
Copy link
Author

Here is the data I am using : https://voice.mozilla.org/fr/datasets
Thx for the time :)

@YasineNifa
Copy link
Author

if you want the vocabulary.txt. Here is a link where can you find it
https://drive.google.com/open?id=1TJH1O5nQsXXO0tLFPRi2zmWUQK_F4wmc

@LiqiangJing
Copy link

Hi, do you fix this question? now I am sturggling with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants