Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we add a Termbase/glossary to the MT engine #68

Open
SafeTex opened this issue Feb 26, 2023 · 3 comments
Open

Can we add a Termbase/glossary to the MT engine #68

SafeTex opened this issue Feb 26, 2023 · 3 comments

Comments

@SafeTex
Copy link

SafeTex commented Feb 26, 2023

Hello Tommi and all

We've touched on this indirectly before.
For example, I have a term "Beställaren" the must always be translated as "the Client" but I get lots of variation such as "the Client", "the Customer", "the Mandator", "the Undertaker", "the Contracting Party" , etc.
I'm reluctant to write a pre-editing rule as I don't want "the Client" in the source segments of the TM(X)
I could write a post-editing rule but some of the above terms appear legitimately in other segments, in particular "the Customer" (the Client has Customers !).
So unless I've misunderstood or missed something, I think I need to be able to add a Termbase/Glossary that simply tells Opus to translate "Beställaren" as "the Client"
On paper, this looks easy - Beställaren = the Client - and rather necessary.
But this feature does not exist at present?
Do you think such a feature will be added at some point?

@TommiNieminen
Copy link
Collaborator

This is something which I have been working on for quite some time. I have experimended with glossary support in a development version of OPUS-CAT, but the problem is that it requires models that have been specifically trained for utilizing glossaries. Training those models takes time, and before starting it, I want to make sure the glossary functionality works as it should, and that it doesn't degrade translation quality in other ways.

Btw., in the scenario we mention, the Client would not pass into the translation memory, since the modified source segment is only used internally in OPUS-CAT, the CAT tool will store the original source segment.

@SafeTex
Copy link
Author

SafeTex commented Feb 27, 2023

Hello Tommi

So does that mean that I can use pre-editing as a substitute for a TB???
I guess the only difference then is that I have to set the pre-edit rules for every single TB entry rather than just give OPUS a TB in the right format with perhaps hundreds or thousands of entries.

Regards

Dave Neve (SafeTex)

@TommiNieminen
Copy link
Collaborator

Pre-edit rules can be used to inject term translations in the source text (in which case they work as a kind of TB substitute), and often the NMT will carry over the term translation to the target text. But this is not the behavior the MT models have been trained to replicate, so it's going to be hit and miss. The planned terminology support in OPUS-CAT will use a similar method of injecting term translations into the source text, but in that case the models will be directly trained to transfer the term translation from source to target.

Also, I suspect that using terms with MT is always going to require some amount of manual work beyond just selecting a TB: one thing I've noticed when working on the term support is that TBs provided to translators are not well suited for MT as such, since they contain too many terms, many of which are overlapping. So term support for MT generally seems to require more carefully managed TBs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants