How to translate Chinese using StartSentencePieceMtPipe #8

xulihang · 2021-01-22T03:34:35Z

I am using the Tatoeba-MT-models to test English-Chinese translation with StartSentencePieceMtPipe.bat.

It works for English to Chinese but not for Chinese to English.

Is there some preprocessing needed for Chinese as the source text?

Chinese to English:

PS E:\Download\FiskmoMTEngine> .\StartSentencePieceMtPipe.bat C:\Users\xulihang\AppData\Local\fiskmo\models\zh-en\opus-2020
-07-17-zh2en
测试
▁P s st .
实验
▁P s st .

In addition, I found the translation will be repeated if the source text is very short.

English to Chinese:

PS E:\Download\FiskmoMTEngine> .\StartSentencePieceMtPipe.bat C:\Users\xulihang\AppData\Local\fiskmo\models\en-zh\opus-2020
-07-17
Test
▁ 测试 测试 测试
Experiment
▁ 实验 实验

The text was updated successfully, but these errors were encountered:

xulihang · 2021-01-22T05:26:34Z

I think this is because I am using PowerShell. I change to cygwin and it works fine.

TommiNieminen · 2021-01-22T08:56:01Z

Powershell probably doesn't input UTF8 when you type into it, that's probably the problem, the script expects UTF8. When you use the script with cmd.exe, the chcp 65001 line in the script sets UTF8, but that won't work with Powershell.

The English to Chinese model seems to be a multilingual model with many possible target languages. With those kinds of models, you need to add a token for the target language as the first token of the source sentence. So you should type e.g. >>cmn<< Test as the input (this is mentioned in the README.md of the model, but it can be hard to notice). The possible target codes are: cjy_Hans cjy_Hant cmn cmn_Hans cmn_Hant gan lzh lzh_Hans nan wuu yue yue_Hans yue_Hant. I'm not sure if that will resolve the issue of repeated output, it might have some other cause.

xulihang · 2021-01-22T09:54:37Z

The target id seems to work. But the target text is mixed with the target id.

PS E:\Download\FiskmoMTEngine> .\StartSentencePieceMtPipe.bat C:\Users\xulihang\AppData\Local\fiskmo\models\en-zh\opus-2020
-07-17
>>cjy_Hans<< Test
▁@ c j y _ H ans _ 测试
>>cmn<< Test
▁* ▁ 立 方 公 尺 测试 ▁*
>>yue<< Test
▁@ ▁ y ue ▁@ ▁ 测试
>>zho<< Test
▁@ ▁ z ho ▁@ ▁ 测试 ▁@ ▁ 测试
>>wuu<< Test
▁W u u ▁ 測 試

TommiNieminen · 2021-01-22T11:45:39Z

Ok, this is actually a problematic thing: the multilingual model expects the language id token to be applied after SentencePiece processing, but StartSentencePieceMtPipe.bat script does not really allow that. The script would need to add the language id after the SentencePiece processing. This would be trivial with any other scripting language, but I use batch scripts because they run reliably on any system without admin rights. I can't find any easy way to manipulate piped strings with batch scripts, so I'll have to redo the processing to implement it. But if you use Cygwin you could use sed to add the language id (need to adapt the paths for Linux as well):

Preprocessing\spm_encode.exe --model %modeldir%\source.spm | sed -e "s/^/>>cmn<< /"
 | Marian\marian.exe decode --log-level=warn -c %modeldir%\decoder.yml --max-length=200 --max-length-crop

TommiNieminen · 2021-06-11T15:00:13Z

The new release of OPUS-CAT supports multilingual Tatoeba-MT models, so I'm closing this.

TommiNieminen closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to translate Chinese using StartSentencePieceMtPipe #8

How to translate Chinese using StartSentencePieceMtPipe #8

xulihang commented Jan 22, 2021

xulihang commented Jan 22, 2021

TommiNieminen commented Jan 22, 2021

xulihang commented Jan 22, 2021

TommiNieminen commented Jan 22, 2021

TommiNieminen commented Jun 11, 2021

How to translate Chinese using StartSentencePieceMtPipe #8

How to translate Chinese using StartSentencePieceMtPipe #8

Comments

xulihang commented Jan 22, 2021

xulihang commented Jan 22, 2021

TommiNieminen commented Jan 22, 2021

xulihang commented Jan 22, 2021

TommiNieminen commented Jan 22, 2021

TommiNieminen commented Jun 11, 2021