Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error message: "Not a valid .tmx file" #35

Closed
ejlehton opened this issue Mar 23, 2022 · 4 comments
Closed

Error message: "Not a valid .tmx file" #35

ejlehton opened this issue Mar 23, 2022 · 4 comments

Comments

@ejlehton
Copy link

ejlehton commented Mar 23, 2022

Hi! I've been using the OPUS-CAT Trados plugin for a while now and it's mostly been working really well. However, today I've run into an issue when trying to fine-tune a Swedish-Finnish model directly in the OPUS-CAT MT Engine using a tmx file downloaded from Trados Live. I keep getting an error message saying "[file name] is not a valid .tmx file".

I've encountered the same error a couple of times in the past, and in those cases, I was able to solve the issue by reducing the number of segment pairs in the file to under 10,000. The file I'm using now only has around 2,800 segment pairs, though, so I don't think file size should be the problem. I've also made sure that the file name doesn't contain "ä" or "ö" or other characters that could cause issues. I also tried fine-tuning two different models and got the same error message each time, so it does seem like the tmx file itself is the problem.

Is there something obvious that I'm missing here? Can there be something wrong with the actual contents of the tmx file? I find this problem particularly strange because I've successfully fine-tuned other models with other language pairs from the same Trados Live TM.

Here's a screenshot of the error message:

image

ETA: My colleague has the same issue with the same tmx file, so I also don't think it's just a "me problem". :P

@TommiNieminen
Copy link
Collaborator

TommiNieminen commented Mar 23, 2022

Hi,

That error message indicates an XML exception from Microsoft's XML processing library, so it's probable that there is something wrong with the file itself, i.e. that it is not valid XML. There have been a couple of new releases after 1.1.0.4, where I have changed this code, so the problem may be fixed already (although if it's invalid XML, it won't work).

You could try version 1.1.0.8 (https://github.com/Helsinki-NLP/OPUS-CAT/releases/download/engine_v1.1.0.8/OpusCatMTEngine_v1.1.0.8.zip), that version also records the XML error in the log file (newest file in %localappdata%/opuscat/logs), so you can see what the exact reason for the failure is.

-Tommi

@SafeTex
Copy link

SafeTex commented Mar 25, 2022

Hello ejlehton, Tommi and all

Looking at this question and Tommi's response, I'd try a couple of things
1 Try reducing the name of the TMX to something so basic that there can be very little doubt left about the name. Maybe rename to test.tmx and put it on your desktop so that the path to it is also "minimal"
2 Open the TMX in Notepad ++ or similar and see what the name of the TMX is in the header and perform a similar operation to above
3. Download the very useful free tool TMX Validator
at: https://www.maxprograms.com/products/tmxvalidator.html

to see if the TMX is actually okay inside the file. If it is not, I think the validator will give you enough info to perhaps fix the problem in Notepadd ++ or similar

Good luck

@SafeTex
Copy link

SafeTex commented Mar 28, 2022

Hello

Since answering this question, I had the same problem but in a completely different scenario using a new aligner (Stingray). Neither Olifant nor memoQ could open it but Heartsome TM editor could. TM Validator said the TM was okay. Here is what the creator of the aligner said

"Try opening the TMX file with a good XML editor and you will see it’s OK from XML point of view.

TMX 1.4b DTD defines the “version” attribute as “#FIXED” with a value of “1.4”. That means that if you read the TMX file with an XML parser, it automatically gets the “version” attribute with a value of “1.4”.

So, the “version” attribute in the element is missing in your file, but it is not required as the corresponding DTD is already declared in the second line of that file.

Anyway, I’ll add the attribute in next build for compatibility with tools that can’t properly read TMX as XML."

Could this be the problem?

@ejlehton
Copy link
Author

Hi Tommi and SafeTex!

Thank you for looking into my problem and sorry about the late response. I followed both of your suggestions and was able to determine that the problem was caused by the invalid XML character "&#x1F". I don't know much about XML, so I have no clue what that character is supposed to do, but I used the Trados TM Maintenance editor to take a look at the segments where it appeared, and it just showed up in the middle of words as a box with a question mark:
mystery character in TM
The segments where it appeared all seemed to be from the same file, which was translated in 2010, and they were all in the Swedish source segments. Perhaps there was something strange about the source text file, or maybe an old version of Trados had broken something.

In any case, I removed all 27 instances of "&#x1F" from the tmx file, and the finetuning worked perfectly after that. Thank you again for your help! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants