Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing minus signs and post-edit regex rule #67

Open
SafeTex opened this issue Feb 22, 2023 · 6 comments
Open

Missing minus signs and post-edit regex rule #67

SafeTex opened this issue Feb 22, 2023 · 6 comments

Comments

@SafeTex
Copy link

SafeTex commented Feb 22, 2023

Hello

I had a job today where minus signs (hyphens) before numbers were missing if the string occured inside brackets, e.g
(- 2 500 TSEK)
translated as
(TSEK 2 500)
instead of
(TSEK -2 500)

So I want to write my first regex rule (post editing) for OPUS but it does not work.
POST EDIT RULE FOR MISSING MINUS SIGN

Can anyway tell me why?
Thanks

@TommiNieminen
Copy link
Collaborator

Hi,

The problem is that $1 copies the first capture group in the post-edit pattern, but there are no capture groups defined in the post-edit pattern. The capturing group that you want to use is in the source pattern, and in post-edit replacement that group can be copied by using $<1>. So changing the Post-edit replacement value to -$<1> should resolve the issue.

Btw., it might be the space after the minus that is causing the minus to be dropped, so it might be easier to solve this using a pre-editing rule that removes the space after the minus sign (this might work better if there are multiple problematic minus signs in the same sentence).

@SafeTex
Copy link
Author

SafeTex commented Feb 23, 2023

Hello Tommi

I decided to simplify my rule as the brackets don't really matter. What matters is that if I have a minus followed by a digit, then the minus sign is not lost

So I tried a simplified rule which works when there is one digit like "-5", but as you can see below it is invasive when there are several digits like 500

POST EDIT RULE FOR MISSING MINUS SIGN

While I understand what's happening, I still don't see why as in the source pattern, I've clearly stated that this rule should only be applied if the digit is proceeded by a minus sign

And in "-500", only the first digit (5) is preceded by a minus sign so why is the rule also applied to the other digits (0s) that follow.

Sorry to be a pain but regex is not easy and I only learned the basics a few years ago. I'm not a programmer so it's a struggle

@TommiNieminen
Copy link
Collaborator

In post-edit rules, the source pattern is only considered a condition for applying the rule, so if the source text matches the pattern, the rule is applied to the MT output as many times as possible. Since the post-edit pattern specified (\d) applies to every single digit, the replacement is also performed for every digit in the MT output, that's why a minus sign is added before each digit.

You can solve this problem by using a post-edit pattern that matches all adjacent digits at once instead of matching each digit separately. The regex operator for achieving that is the plus sign. To match all strings of consecutive digits, use \d+. Since numbers may have spaces, commas or periods in addition to digits, it may be necessary to use a pattern that matches all of those characters, like this: [\d,. ]+

@SafeTex
Copy link
Author

SafeTex commented Feb 23, 2023

Hello Tommi

I don't think that works either cos if I have
-500 000
mistranslated as
500 000
and I change it with a post editing rule to
-500 000
the same rule will also change
600 000
to

  • 600 000
    where it should not change it.
    I've tested this to confirm my doubt and the result was:

POST EDIT RULE FOR MISSING MINUS SIGN

or am I missing something?
Thanks in advance

@TommiNieminen
Copy link
Collaborator

Yes, that's a limitation of regular expressions, they are difficult to target exactly. That's why in your case it might be better to use a pre-edit rule that formats the source number in such a way as to nudge the MT to use the minus sign correctly. The post-edit minus sign correction is more useful in cases where you expect there to be just one number in the segment.

@SafeTex
Copy link
Author

SafeTex commented Feb 24, 2023

Hello Tommi

I did manage to write a regex with an atomic group in it that works for numbers like
5
50
500
5 000
5 000 000
with the minus sign being inserted multiple times in the last two examples and I can easily make that apply to segments with just one number.

It may be ineloquent but it's:

(\d{1,3})((?>\s\d{3}))?((?>\s\d{3}))?

I'm not sure what causes this problem in the source text so I'd have to look at that more before attempting a source rule

Thanks for all your help and I think I'll have another question soon if you don't mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants