-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tuning is very slow and gets stuck #92
Comments
Hi, That sounds strange, probably due to some change in the computing environment (like for instance a virus scanner ). You can check the log files to see if there's any indication there. There's the normal logs in %LOCALAPPDATA%/logs, and then there's the training log from the finetuning process itself, which is located in the model directory of the fine-tuned model (you can open that with the Open model directory button). The name of the training log will be something like this: opusTCv20210807+bt.spm32k-spm32k.transformer-align.train1.log. |
Hi Tommi, thanks for the quick reply! It makes sense that the problem would probably be somewhere in our systems - now I just need to figure out what it is. :) I did some fine-tuning tests and looked at the logs and other files in the model directories. I noticed that when the fine-tuning has failed, the file with a name like opusTCv20210807+bt.spm32k-spm32k.transformer-big.model1.npz.best-perplexity.npz.optimizer.npz always seems to be missing, even though the last entry in the training log is that OPUS-CAT is "saving Adam parameters" to this file. Looking at the directories for models that I have successfully fine-tuned in the past, the optimizer.npz file always seems to be present. I did notice that two of my recent fine-tuning tests had been successful even though the optimizer.npz file was missing, but those tests were conducted with very small TMs (around 1,500 segment pairs). However, with bigger TMs, a missing optimizer.npz file seems to correlate with the fine-tuning getting stuck. My training logs don't seem to contain any error messages, but my colleague did share this error message from one of her failed fine-tuning attempts:
Does any of this shed any light on what might be going wrong with the fine-tuning? I do realise that this probably isn't a lot to go on. |
Hi, sorry about the late reply. Was this issue resolved? One problem might be that the model that you are fine-tuning is a transformer-big model, which are a lot bigger than the normal transformer models. I haven't tested fine-tuning with those, so it might fail due to memory problems etc. |
Hi! Thanks for you reply. I've run into the same issue with, for example, the opus+bt-2021-03-09 model (just tested again today), so I don't think the transformer-big models are the source of the problem, particularly as we've been able to fine-tune them without problems before. I'm about head out on vacation, but when I'm back, I'm going to ask a few other teammates to also do some fine-tuning tests to see if it's just me and one colleague who are affected or whether everyone on the team has the same problem regardless of e.g. laptop age. I'll be back with an update! :) |
Hi, finally checking in on this issue. As you suspected, this was a problem in our environment, i.e. a simple case of low disk space. :D After a bit of cleanup, fine-tuning is working normally again. Thanks again for taking the time to help with this! |
Hi! My colleague and I recently noticed that the fine-tuning process in OPUS-CAT has become very slow and tends to get stuck (the time estimate stops at a certain time and the fine-tuning doesn't seem to progress). We've been using version 1.2.0, but I downloaded the 1.2.4.0 version to test it, and the same issue persists.
We've been doing the fine-tuning directly in OPUS-CAT (not in Trados), using tmx files that typically have a maximum of 10,000 segment pairs. In the past, the fine-tuning process has typically taken between 10 and 30 minutes (depending on the size of the tmx file), but now the time estimate is typically several hours and the fine-tuning tends to fail at some point.
Any idea what might be causing this? We're using similar tmx files as before and haven't made any other changes to our systems or processes that I can think of (although I suppose it's possible that our company's IT department has made changes somewhere that I'm not aware of).
Thank you in advance for your help! :)
EDIT: Just wanted to add that OPUS-CAT is otherwise working fine, it's just the fine-tuning that is causing problems.
The text was updated successfully, but these errors were encountered: