Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to translate Chinese using StartSentencePieceMtPipe #8

Closed
xulihang opened this issue Jan 22, 2021 · 5 comments
Closed

How to translate Chinese using StartSentencePieceMtPipe #8

xulihang opened this issue Jan 22, 2021 · 5 comments

Comments

@xulihang
Copy link

I am using the Tatoeba-MT-models to test English-Chinese translation with StartSentencePieceMtPipe.bat.

It works for English to Chinese but not for Chinese to English.

Is there some preprocessing needed for Chinese as the source text?

Chinese to English:

PS E:\Download\FiskmoMTEngine> .\StartSentencePieceMtPipe.bat C:\Users\xulihang\AppData\Local\fiskmo\models\zh-en\opus-2020
-07-17-zh2en
测试
▁P s st .
实验
▁P s st .

In addition, I found the translation will be repeated if the source text is very short.

English to Chinese:

PS E:\Download\FiskmoMTEngine> .\StartSentencePieceMtPipe.bat C:\Users\xulihang\AppData\Local\fiskmo\models\en-zh\opus-2020
-07-17
Test
▁ 测试 测试 测试
Experiment
▁ 实验 实验
@xulihang
Copy link
Author

I think this is because I am using PowerShell. I change to cygwin and it works fine.

@TommiNieminen
Copy link
Collaborator

Powershell probably doesn't input UTF8 when you type into it, that's probably the problem, the script expects UTF8. When you use the script with cmd.exe, the chcp 65001 line in the script sets UTF8, but that won't work with Powershell.

The English to Chinese model seems to be a multilingual model with many possible target languages. With those kinds of models, you need to add a token for the target language as the first token of the source sentence. So you should type e.g. >>cmn<< Test as the input (this is mentioned in the README.md of the model, but it can be hard to notice). The possible target codes are: cjy_Hans cjy_Hant cmn cmn_Hans cmn_Hant gan lzh lzh_Hans nan wuu yue yue_Hans yue_Hant. I'm not sure if that will resolve the issue of repeated output, it might have some other cause.

@xulihang
Copy link
Author

The target id seems to work. But the target text is mixed with the target id.

PS E:\Download\FiskmoMTEngine> .\StartSentencePieceMtPipe.bat C:\Users\xulihang\AppData\Local\fiskmo\models\en-zh\opus-2020
-07-17
>>cjy_Hans<< Test
▁@ c j y _ H ans _ 测试
>>cmn<< Test
▁* ▁ 立 方 公 尺 测试 ▁*
>>yue<< Test
▁@ ▁ y ue ▁@ ▁ 测试
>>zho<< Test
▁@ ▁ z ho ▁@ ▁ 测试 ▁@ ▁ 测试
>>wuu<< Test
▁W u u ▁ 測 試

@TommiNieminen
Copy link
Collaborator

Ok, this is actually a problematic thing: the multilingual model expects the language id token to be applied after SentencePiece processing, but StartSentencePieceMtPipe.bat script does not really allow that. The script would need to add the language id after the SentencePiece processing. This would be trivial with any other scripting language, but I use batch scripts because they run reliably on any system without admin rights. I can't find any easy way to manipulate piped strings with batch scripts, so I'll have to redo the processing to implement it. But if you use Cygwin you could use sed to add the language id (need to adapt the paths for Linux as well):

Preprocessing\spm_encode.exe --model %modeldir%\source.spm | sed -e "s/^/>>cmn<< /"
 | Marian\marian.exe decode --log-level=warn -c %modeldir%\decoder.yml --max-length=200 --max-length-crop

@TommiNieminen
Copy link
Collaborator

The new release of OPUS-CAT supports multilingual Tatoeba-MT models, so I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants