Training using TMX working? #10

mikethetexan · 2021-01-25T23:11:43Z

Hi Tommi. I am not a developer, but a technology product owner for my company, looking at MT solution. I stumbled on your OPUS CAT solution while looking at MT plugins in Trados.

I see that in the documentation, you recommend doing fine tuning through a Batch task in Studio, but that seems to be specific to an XLIFF document if I understand it.

What I want is tune your model with my custom TMX (380k Translation Units after I filtered a bunch of stuff), so I used that feature in the tool, but unless I'm just not patient enough, it doesn't look like anything is happening. I see a status progress bar, but it's not moving much at all and my CPU usage is very low, which suggests that nothing is processing.

Is this feature currently working? What customization times should I be looking at using a reasonably fast quad core i7 processor (from a couple years ago).

Thanks!

mikethetexan · 2021-01-25T23:27:42Z

I see at https://helsinki-nlp.github.io/OPUS-CAT/enginefinetune that TMX finetuning should work. I also see you have screenshot of OPUS-CAT v 1.0.0.1 and I have 1.0.0.0. Could that explain the difference? Or can you not fine tune on a large TMX like mine (300+ MB)

TommiNieminen · 2021-01-25T23:31:06Z

Hi Mike, thanks for trying out OPUS-CAT. It looks like there might be a bug here.

There's a log file in %localappdata%/opuscat/logs which might contain an error message from the customization feature. You can also open the model directory (with the Open model directory button on the right, and see if a train.log file exists there. train.log contains the log information from the actual fine-tuning, so it might contain more information about the cause of the error.

380k translation units is a pretty big TM, so that might be breaking something. In principle the customizing should work with any number of sentences, but I haven't tested it with anything as big as that. It's possible to customize a model with that amount of translation units, but it's going to take a long time (I'd guess at least 40 hours on default settings, less if you increase the amount of CPU cores used).

Also the model name seems to be pretty long and it contains spaces, so that might also be a problem (this is something I should have tested, but unfortunately have not yet).

mikethetexan · 2021-01-26T00:06:57Z

opuscat_log20210125.txt

Here is the log. I will be trying the finetune method in Studio now and will be retrying with a smaller TMX and no spaces in the modelname.

TommiNieminen · 2021-01-26T00:24:42Z

I can reproduce the error by adding a space to the model label, so that's the immediate cause of failure. I'll fix this by replacing spaces with underscores in labels.

I'd still recommend starting with a smaller tmx, just to get some intermediate results on the speed and the effects of customization.

mikethetexan · 2021-01-26T01:45:38Z

Awesome responsiveness and you are located in Europe, right?

Anyway, I am now using the finetune through Studio projects and it is working fine. I kind of like the finetune using that method, as you can narrow the domain you want to tune for, which can probably be beneficial for specific applications. I'm still interested in tuning using my whole TMX file, but I understand it will be very slow.

TommiNieminen · 2021-01-26T08:52:20Z

I'm in Europe, but I keep late hours (ideal for US support :)).

Let me know if you run into more problems. If you want to train with the large TMX, it's best to increase the amount of threads assigned to the training. You can do that by modifying the training config file, which you can open by selecting the Settings tab and clicking open Finetune settings in text editor (the button name may be different, it's been changed recently). The parameter for thread count is cpu-threads. The workspace parameter (allocated memory) may also need to be increased, I think each thread requires its own chunk of memory.

TommiNieminen · 2021-01-27T08:33:57Z

The problem with spaces in label name stopping fine-tuning is now fixed in a new version linked to on the installation page: https://helsinki-nlp.github.io/OPUS-CAT/install

TommiNieminen closed this as completed Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training using TMX working? #10

Training using TMX working? #10

mikethetexan commented Jan 25, 2021

mikethetexan commented Jan 25, 2021

TommiNieminen commented Jan 25, 2021

mikethetexan commented Jan 26, 2021

TommiNieminen commented Jan 26, 2021

mikethetexan commented Jan 26, 2021

TommiNieminen commented Jan 26, 2021

TommiNieminen commented Jan 27, 2021

Training using TMX working? #10

Training using TMX working? #10

Comments

mikethetexan commented Jan 25, 2021

mikethetexan commented Jan 25, 2021

TommiNieminen commented Jan 25, 2021

mikethetexan commented Jan 26, 2021

TommiNieminen commented Jan 26, 2021

mikethetexan commented Jan 26, 2021

TommiNieminen commented Jan 26, 2021

TommiNieminen commented Jan 27, 2021