Fine-tuning is very slow and gets stuck #92

ejlehton · 2024-03-07T09:16:41Z

Hi! My colleague and I recently noticed that the fine-tuning process in OPUS-CAT has become very slow and tends to get stuck (the time estimate stops at a certain time and the fine-tuning doesn't seem to progress). We've been using version 1.2.0, but I downloaded the 1.2.4.0 version to test it, and the same issue persists.

We've been doing the fine-tuning directly in OPUS-CAT (not in Trados), using tmx files that typically have a maximum of 10,000 segment pairs. In the past, the fine-tuning process has typically taken between 10 and 30 minutes (depending on the size of the tmx file), but now the time estimate is typically several hours and the fine-tuning tends to fail at some point.

Any idea what might be causing this? We're using similar tmx files as before and haven't made any other changes to our systems or processes that I can think of (although I suppose it's possible that our company's IT department has made changes somewhere that I'm not aware of).

Thank you in advance for your help! :)

EDIT: Just wanted to add that OPUS-CAT is otherwise working fine, it's just the fine-tuning that is causing problems.

TommiNieminen · 2024-03-07T10:01:02Z

Hi, That sounds strange, probably due to some change in the computing environment (like for instance a virus scanner ). You can check the log files to see if there's any indication there. There's the normal logs in %LOCALAPPDATA%/logs, and then there's the training log from the finetuning process itself, which is located in the model directory of the fine-tuned model (you can open that with the Open model directory button). The name of the training log will be something like this: opusTCv20210807+bt.spm32k-spm32k.transformer-align.train1.log.

ejlehton · 2024-03-11T13:06:45Z

Hi Tommi, thanks for the quick reply! It makes sense that the problem would probably be somewhere in our systems - now I just need to figure out what it is. :)

I did some fine-tuning tests and looked at the logs and other files in the model directories. I noticed that when the fine-tuning has failed, the file with a name like opusTCv20210807+bt.spm32k-spm32k.transformer-big.model1.npz.best-perplexity.npz.optimizer.npz always seems to be missing, even though the last entry in the training log is that OPUS-CAT is "saving Adam parameters" to this file. Looking at the directories for models that I have successfully fine-tuned in the past, the optimizer.npz file always seems to be present.

I did notice that two of my recent fine-tuning tests had been successful even though the optimizer.npz file was missing, but those tests were conducted with very small TMs (around 1,500 segment pairs). However, with bigger TMs, a missing optimizer.npz file seems to correlate with the fine-tuning getting stuck.

My training logs don't seem to contain any error messages, but my colleague did share this error message from one of her failed fine-tuning attempts:

[2024-01-15 13:51:50] Error: Unhandled exception of type 'class std::bad_alloc': bad allocation
[2024-01-15 13:51:50] Error: Aborted from unhandledException in C:\Users\niemi\Documents\Code\marian\src\common\logging.cpp:113

[CALL STACK]
    > 00007FF691B3C610 (SymFromAddr() error)
    - 00007FF691B339C8 (SymFromAddr() error)
    - RtlCaptureContext2
    - 00007FF6919B6C9E (SymFromAddr() error)
    - 00007FF691B5E428 (SymFromAddr() error)
    - 00007FF691B013FB (SymFromAddr() error)
    - UnhandledExceptionFilter
    - memset
    - _C_specific_handler
    - _chkstk
    - RtlRaiseException
    - RtlRaiseException
    - RaiseException
    - 00007FF691B33A31 (SymFromAddr() error)
    - RtlCaptureContext2
    - 00007FF6916CD8C0 (SymFromAddr() error)

Does any of this shed any light on what might be going wrong with the fine-tuning? I do realise that this probably isn't a lot to go on.

TommiNieminen · 2024-04-30T09:30:32Z

Hi, sorry about the late reply. Was this issue resolved? One problem might be that the model that you are fine-tuning is a transformer-big model, which are a lot bigger than the normal transformer models. I haven't tested fine-tuning with those, so it might fail due to memory problems etc.

ejlehton · 2024-05-17T13:03:12Z

Hi! Thanks for you reply. I've run into the same issue with, for example, the opus+bt-2021-03-09 model (just tested again today), so I don't think the transformer-big models are the source of the problem, particularly as we've been able to fine-tune them without problems before.

I'm about head out on vacation, but when I'm back, I'm going to ask a few other teammates to also do some fine-tuning tests to see if it's just me and one colleague who are affected or whether everyone on the team has the same problem regardless of e.g. laptop age. I'll be back with an update! :)

ejlehton · 2024-06-18T12:45:56Z

Hi, finally checking in on this issue. As you suspected, this was a problem in our environment, i.e. a simple case of low disk space. :D After a bit of cleanup, fine-tuning is working normally again. Thanks again for taking the time to help with this!

TommiNieminen added the bug Something isn't working label Apr 30, 2024

ejlehton closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning is very slow and gets stuck #92

Fine-tuning is very slow and gets stuck #92

ejlehton commented Mar 7, 2024 •

edited

Loading

TommiNieminen commented Mar 7, 2024

ejlehton commented Mar 11, 2024

TommiNieminen commented Apr 30, 2024

ejlehton commented May 17, 2024

ejlehton commented Jun 18, 2024

Fine-tuning is very slow and gets stuck #92

Fine-tuning is very slow and gets stuck #92

Comments

ejlehton commented Mar 7, 2024 • edited Loading

TommiNieminen commented Mar 7, 2024

ejlehton commented Mar 11, 2024

TommiNieminen commented Apr 30, 2024

ejlehton commented May 17, 2024

ejlehton commented Jun 18, 2024

ejlehton commented Mar 7, 2024 •

edited

Loading