Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training using TMX working? #10

Closed
mikethetexan opened this issue Jan 25, 2021 · 7 comments
Closed

Training using TMX working? #10

mikethetexan opened this issue Jan 25, 2021 · 7 comments

Comments

@mikethetexan
Copy link

Hi Tommi. I am not a developer, but a technology product owner for my company, looking at MT solution. I stumbled on your OPUS CAT solution while looking at MT plugins in Trados.

I see that in the documentation, you recommend doing fine tuning through a Batch task in Studio, but that seems to be specific to an XLIFF document if I understand it.

What I want is tune your model with my custom TMX (380k Translation Units after I filtered a bunch of stuff), so I used that feature in the tool, but unless I'm just not patient enough, it doesn't look like anything is happening. I see a status progress bar, but it's not moving much at all and my CPU usage is very low, which suggests that nothing is processing.

Is this feature currently working? What customization times should I be looking at using a reasonably fast quad core i7 processor (from a couple years ago).

Thanks!

image

@mikethetexan
Copy link
Author

I see at https://helsinki-nlp.github.io/OPUS-CAT/enginefinetune that TMX finetuning should work. I also see you have screenshot of OPUS-CAT v 1.0.0.1 and I have 1.0.0.0. Could that explain the difference? Or can you not fine tune on a large TMX like mine (300+ MB)

@TommiNieminen
Copy link
Collaborator

Hi Mike, thanks for trying out OPUS-CAT. It looks like there might be a bug here.

There's a log file in %localappdata%/opuscat/logs which might contain an error message from the customization feature. You can also open the model directory (with the Open model directory button on the right, and see if a train.log file exists there. train.log contains the log information from the actual fine-tuning, so it might contain more information about the cause of the error.

380k translation units is a pretty big TM, so that might be breaking something. In principle the customizing should work with any number of sentences, but I haven't tested it with anything as big as that. It's possible to customize a model with that amount of translation units, but it's going to take a long time (I'd guess at least 40 hours on default settings, less if you increase the amount of CPU cores used).

Also the model name seems to be pretty long and it contains spaces, so that might also be a problem (this is something I should have tested, but unfortunately have not yet).

@mikethetexan
Copy link
Author

opuscat_log20210125.txt

Here is the log. I will be trying the finetune method in Studio now and will be retrying with a smaller TMX and no spaces in the modelname.

@TommiNieminen
Copy link
Collaborator

I can reproduce the error by adding a space to the model label, so that's the immediate cause of failure. I'll fix this by replacing spaces with underscores in labels.

I'd still recommend starting with a smaller tmx, just to get some intermediate results on the speed and the effects of customization.

@mikethetexan
Copy link
Author

Awesome responsiveness and you are located in Europe, right?

Anyway, I am now using the finetune through Studio projects and it is working fine. I kind of like the finetune using that method, as you can narrow the domain you want to tune for, which can probably be beneficial for specific applications. I'm still interested in tuning using my whole TMX file, but I understand it will be very slow.

@TommiNieminen
Copy link
Collaborator

I'm in Europe, but I keep late hours (ideal for US support :)).

Let me know if you run into more problems. If you want to train with the large TMX, it's best to increase the amount of threads assigned to the training. You can do that by modifying the training config file, which you can open by selecting the Settings tab and clicking open Finetune settings in text editor (the button name may be different, it's been changed recently). The parameter for thread count is cpu-threads. The workspace parameter (allocated memory) may also need to be increased, I think each thread requires its own chunk of memory.

@TommiNieminen
Copy link
Collaborator

The problem with spaces in label name stopping fine-tuning is now fixed in a new version linked to on the installation page: https://helsinki-nlp.github.io/OPUS-CAT/install

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants