Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning is very slow and gets stuck #92

Closed
ejlehton opened this issue Mar 7, 2024 · 5 comments
Closed

Fine-tuning is very slow and gets stuck #92

ejlehton opened this issue Mar 7, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@ejlehton
Copy link

ejlehton commented Mar 7, 2024

Hi! My colleague and I recently noticed that the fine-tuning process in OPUS-CAT has become very slow and tends to get stuck (the time estimate stops at a certain time and the fine-tuning doesn't seem to progress). We've been using version 1.2.0, but I downloaded the 1.2.4.0 version to test it, and the same issue persists.

We've been doing the fine-tuning directly in OPUS-CAT (not in Trados), using tmx files that typically have a maximum of 10,000 segment pairs. In the past, the fine-tuning process has typically taken between 10 and 30 minutes (depending on the size of the tmx file), but now the time estimate is typically several hours and the fine-tuning tends to fail at some point.

Any idea what might be causing this? We're using similar tmx files as before and haven't made any other changes to our systems or processes that I can think of (although I suppose it's possible that our company's IT department has made changes somewhere that I'm not aware of).

Thank you in advance for your help! :)

EDIT: Just wanted to add that OPUS-CAT is otherwise working fine, it's just the fine-tuning that is causing problems.

@TommiNieminen
Copy link
Collaborator

Hi, That sounds strange, probably due to some change in the computing environment (like for instance a virus scanner ). You can check the log files to see if there's any indication there. There's the normal logs in %LOCALAPPDATA%/logs, and then there's the training log from the finetuning process itself, which is located in the model directory of the fine-tuned model (you can open that with the Open model directory button). The name of the training log will be something like this: opusTCv20210807+bt.spm32k-spm32k.transformer-align.train1.log.

@ejlehton
Copy link
Author

Hi Tommi, thanks for the quick reply! It makes sense that the problem would probably be somewhere in our systems - now I just need to figure out what it is. :)

I did some fine-tuning tests and looked at the logs and other files in the model directories. I noticed that when the fine-tuning has failed, the file with a name like opusTCv20210807+bt.spm32k-spm32k.transformer-big.model1.npz.best-perplexity.npz.optimizer.npz always seems to be missing, even though the last entry in the training log is that OPUS-CAT is "saving Adam parameters" to this file. Looking at the directories for models that I have successfully fine-tuned in the past, the optimizer.npz file always seems to be present.

I did notice that two of my recent fine-tuning tests had been successful even though the optimizer.npz file was missing, but those tests were conducted with very small TMs (around 1,500 segment pairs). However, with bigger TMs, a missing optimizer.npz file seems to correlate with the fine-tuning getting stuck.

My training logs don't seem to contain any error messages, but my colleague did share this error message from one of her failed fine-tuning attempts:

[2024-01-15 13:51:50] Error: Unhandled exception of type 'class std::bad_alloc': bad allocation
[2024-01-15 13:51:50] Error: Aborted from unhandledException in C:\Users\niemi\Documents\Code\marian\src\common\logging.cpp:113

[CALL STACK]
    > 00007FF691B3C610 (SymFromAddr() error)
    - 00007FF691B339C8 (SymFromAddr() error)
    - RtlCaptureContext2
    - 00007FF6919B6C9E (SymFromAddr() error)
    - 00007FF691B5E428 (SymFromAddr() error)
    - 00007FF691B013FB (SymFromAddr() error)
    - UnhandledExceptionFilter
    - memset
    - _C_specific_handler
    - _chkstk
    - RtlRaiseException
    - RtlRaiseException
    - RaiseException
    - 00007FF691B33A31 (SymFromAddr() error)
    - RtlCaptureContext2
    - 00007FF6916CD8C0 (SymFromAddr() error)

Does any of this shed any light on what might be going wrong with the fine-tuning? I do realise that this probably isn't a lot to go on.

@TommiNieminen
Copy link
Collaborator

Hi, sorry about the late reply. Was this issue resolved? One problem might be that the model that you are fine-tuning is a transformer-big model, which are a lot bigger than the normal transformer models. I haven't tested fine-tuning with those, so it might fail due to memory problems etc.

@TommiNieminen TommiNieminen added the bug Something isn't working label Apr 30, 2024
@ejlehton
Copy link
Author

Hi! Thanks for you reply. I've run into the same issue with, for example, the opus+bt-2021-03-09 model (just tested again today), so I don't think the transformer-big models are the source of the problem, particularly as we've been able to fine-tune them without problems before.

I'm about head out on vacation, but when I'm back, I'm going to ask a few other teammates to also do some fine-tuning tests to see if it's just me and one colleague who are affected or whether everyone on the team has the same problem regardless of e.g. laptop age. I'll be back with an update! :)

@ejlehton
Copy link
Author

Hi, finally checking in on this issue. As you suspected, this was a problem in our environment, i.e. a simple case of low disk space. :D After a bit of cleanup, fine-tuning is working normally again. Thanks again for taking the time to help with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants