Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag positions in Opus and "\tag" in NET Regular Expressions #72

Open
SafeTex opened this issue Mar 17, 2023 · 2 comments
Open

Tag positions in Opus and "\tag" in NET Regular Expressions #72

SafeTex opened this issue Mar 17, 2023 · 2 comments

Comments

@SafeTex
Copy link

SafeTex commented Mar 17, 2023

Hello Tommi and all

Opus seems to put all the tags in a source segment at the end of the target segment.

I can understand that where words or phrases are tagged, it must be very hard for any MT engine to reposition the tags correctly in the target;

But I'd like to look at the case of where a source segment opens and closes with a tag, while Opus puts both these tags at the end of the target segment. Can this be improved?

Also, I could not do anything today about this in Phrase (formerly MemSource) except to move the tags manually.

However, in memoQ, I can deal with these simpler cases as memoQ has added "\tag" to its NET Regular Expressions engine. So:

Find in target: ^(.+)(\tag)(\tag)$
Replace with: $2$1$3

worked and in semi-automatic mode, I was able to deal with the majority of cases;

All that to ask you if Opus could perhaps protect tags at the very start and end of segments in the future and to inform you, if you did not know, of "\tag" in memoQ, which you might think useful for Opus in the future.

Regards

SafeTex

@TommiNieminen
Copy link
Collaborator

Thanks, I'll keep the \tag convention in mind, it seems pretty useful. The tag functionality in OPUS-CAT currently should position tags according to the word alignments it generates. I haven't checked, but the behavior where tags are added to the end is probably the fallback behavior. So something seems to be interfering with the tag restoration. What model are you using when this happens?

@SafeTex
Copy link
Author

SafeTex commented Mar 23, 2023

Hello Tommi

I'm using a trained Swedish to English model and all tags are always put at the end.
I even had a job where commas and full stops were tagged, due to a perceived difference in font size by an OCR scan, and even these tags ended up at the end of segments
How can I overcome this?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants