Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upper v lowercase MT quality issue #47

Open
SafeTex opened this issue Sep 1, 2022 · 5 comments
Open

Upper v lowercase MT quality issue #47

SafeTex opened this issue Sep 1, 2022 · 5 comments

Comments

@SafeTex
Copy link

SafeTex commented Sep 1, 2022

Hello Tommi and all

A translator on the memoQ IO groups forum mentioned that segments in uppercase are not translated nearly as well as those in lowercase by Opus.
I decided to check this out for myself and he is completely right
I took a couple of Swedish sentences off an easy website, pasted them all twice into Word, then converted the identical text into uppercase and ran it through the Opus-Cat tool
The results are in the attached file
upper v lower case
What do you make of this Tommi?
Thanks in advance

Dave Neve

@TommiNieminen
Copy link
Collaborator

The reason for the quality drop with UPPER CASE TEXT is that OPUS-MT and Tatoeba models consider upper and lower case characters to be entirely different symbols, so they are translated in completely different ways. That might seem weird, but the motivation is that words and texts written in upper case often have a different function than the same words and texts written in lower case, e.g. upper case can communicate headings, warnings etc., which are translated differently than when the same words occur in lower case. So that's the theory, but in practise it seems that there's not enough upper case text in the training data for the NMT model learn how to deal with it. I'll make a note of this problem, in case there's an opportunity to change the current preprocessing to handle cases in a more rational way (but that change would only affect future models).

For the current models, you can circumvent the problem by using the edit rules, since they support changing character case. There's a pre-edit rule in the documentation which relates directly to this case (i.e. it lower-cases input to the MT engine): https://helsinki-nlp.github.io/OPUS-CAT/editrules#case_conversion. If you want to revert to upper case in the machine translation, you would need a post-edit rule that would do the reverse operation, i.e. upper-case everything, like this:

image

@SafeTex
Copy link
Author

SafeTex commented Sep 1, 2022

Hello Tommi
Thanks for the detailed explanation. I've entered the regex you gave us for converting upprcase to lowercase in the pre-editing phase. Who would ever have thought that uppercase would be translated differently to lowercase (not me in any case)
Thanks for all your help
Dave Neve

@SafeTex
Copy link
Author

SafeTex commented Sep 26, 2022

Hello Tommi

I've been keeping an eye on this issue since I was made aware of it.
I think that all uppercase letters are often in titles and headings where we try to be more concise.
But having now seen how poor the translations are by Opus in my last job, I would personally be for OPUS translating all uppercase letters as if they were lowercase

Just a bit of input for you (not a criticism of any sort)

Regards

@TommiNieminen
Copy link
Collaborator

That's been my experience with upper-case text, as well, and I agree that upper-case text should be handled differently. The current models can't be changed, but the case handling could be modified in future models, so I'll copy this thread to Jörg Tiedemann who runs the OPUS model training.

@jorgtied Currently the OPUS models don't handle ALL CAPS text well, probably because there isn't enough of it in the training corpora. The motivation with training models with original casing is that it avoids the problem of truecasing/recasing and that casing occasionally has semantic significance (e.g. ALL CAPS text being mostly headings or non-translatables etc.). However, the scarcity of ALL CAPS training data means that in practice the models will not learn to handle ALL CAPS text properly. So it would probably be best to start using truecasing or recasing (or possibly even casing factors) in the models.

@SafeTex
Copy link
Author

SafeTex commented Sep 26, 2022

That's exactly what I would have said but without the technical jargon of course (as I don't know it)

Thanks for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants