-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to translate Chinese using StartSentencePieceMtPipe #8
Comments
I think this is because I am using PowerShell. I change to cygwin and it works fine. |
Powershell probably doesn't input UTF8 when you type into it, that's probably the problem, the script expects UTF8. When you use the script with cmd.exe, the chcp 65001 line in the script sets UTF8, but that won't work with Powershell. The English to Chinese model seems to be a multilingual model with many possible target languages. With those kinds of models, you need to add a token for the target language as the first token of the source sentence. So you should type e.g. >>cmn<< Test as the input (this is mentioned in the README.md of the model, but it can be hard to notice). The possible target codes are: cjy_Hans cjy_Hant cmn cmn_Hans cmn_Hant gan lzh lzh_Hans nan wuu yue yue_Hans yue_Hant. I'm not sure if that will resolve the issue of repeated output, it might have some other cause. |
The target id seems to work. But the target text is mixed with the target id.
|
Ok, this is actually a problematic thing: the multilingual model expects the language id token to be applied after SentencePiece processing, but StartSentencePieceMtPipe.bat script does not really allow that. The script would need to add the language id after the SentencePiece processing. This would be trivial with any other scripting language, but I use batch scripts because they run reliably on any system without admin rights. I can't find any easy way to manipulate piped strings with batch scripts, so I'll have to redo the processing to implement it. But if you use Cygwin you could use sed to add the language id (need to adapt the paths for Linux as well):
|
The new release of OPUS-CAT supports multilingual Tatoeba-MT models, so I'm closing this. |
I am using the Tatoeba-MT-models to test English-Chinese translation with StartSentencePieceMtPipe.bat.
It works for English to Chinese but not for Chinese to English.
Is there some preprocessing needed for Chinese as the source text?
Chinese to English:
In addition, I found the translation will be repeated if the source text is very short.
English to Chinese:
The text was updated successfully, but these errors were encountered: