Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile Marian with SP integration #1

Open
AHSteel opened this issue Jul 9, 2020 · 4 comments
Open

Compile Marian with SP integration #1

AHSteel opened this issue Jul 9, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@AHSteel
Copy link

AHSteel commented Jul 9, 2020

Hi,

I've been testing the Fiskmö engine + Trados plugin in the developer branch and have encountered the following issue:

The combination works on Windows 10 both with the default settings (OPUS model) and a custom model trained locally. However, output quality is pretty poor and, in the case of the custom model, far below the quality achieved when using the same custom model in conjunction with the standard Marian decoder on Linux. I found that decoder.yml accepts the standard parameters, but as soon as I raise either the mini-batch or maxi-batch settings (even just to 2) the engine waits indefinitely and doesn't return any output (final log entry: 2020-07-09 10:35:18.428 +02:00 [INF] [2020-07-09 10:35:18] [memory] Reserving 295 MB, device cpu0).

Are the mini-batch and maxi-batch settings currently limited to 1, or is there another combination of settings for Windows 10 that will allow me to raise them?

@TommiNieminen
Copy link
Collaborator

Sorry about the late reply, didn't get a notice about this.

The freeze-up with minibatch settings of above 1 seems to be caused by some buffering weirdness with spm_encode.exe and marian.exe, which are part of the MT pipe: when the minibatch is set to over 1, the pipe clogs up when feeding single lines to a continuously open stdin stream (even though both both spm_encode.exe and marian.exe individually handle single lines). This problem only affects the translation provider implementation (i.e. the part of the plugin that serves the translations in the editor and in the standard Trados Pretranslate batch task). The custom Fiskmö fine-tune and translate task actually uses a higher minibatch size, it works there because all segments are translated as a single file batch (note that you need to segment the files before running the Fiskmö custom task, you can do that by opening the files in the editor and saving them or running the Pretranslate or Pseudo-translate tasks).

So technically you can use higher mini-batch and maxi-batch settings, but it will cause a lockup in the plugin due to buffering problems with the MT command pipe, so the minibatch setting is set to 1. This should be fixed, of course, but currently the plugin's translation provider part translates segment by segment in any case, so there's no benefit to using larger batches.

I'm surprised by quality differences when using a decoder on a Linux system, since no settings related to quality should have changed, if the same decoder.yml file is used. My guess is that there's something wrong with preprocessing on the Windows side, or possibly the Linux decoder might be a different version with different defaults for some parameters with a quality effect (preprocessing problem is more likely). If I understand correctly, the custom model with the quality differences was trained through the Trados custom task? What was the base model?

@AHSteel
Copy link
Author

AHSteel commented Nov 14, 2020

Hi Tommi,

Thanks for the detailed reply. You were right about it being a preprocessing problem -- I needed to change my preprocessing parameters when creating my own model (created on Linux with Marian 1.9, not via the Trados custom task) to match those in preprocess.sh (downloaded with the Opus model). Having done that (and having made sure the decoder.yml files on the Linux and Windows boxes used the same parameters), I'm now getting the same results from my custom model via the Fiskmö engine+plugin as I do when I query it directly on the Linux box.

My next question is about SentencePiece implementation in Fiskmö ... You refer to spm_encode.exe above and I see that StartSentencePieceMtPipe.bat is included with the Fiskmö engine, so I assume SP is supported, but when I try calling a custom SP model (created on Linux with Marian 1.9) from Fiskmö, I get the following error:

Error: *.spm suffix in path C:/Users/.../vocab.esen.spm reserved for SentencePiece models, but support for SentencePiece is not compiled into Marian. Try to recompile after cmake .. -DUSE_SENTENCEPIECE=on [...]
Error: Aborted from marian::createSentencePieceVocab in C:\Users\niemi...\sentencepiece_vocab.cpp:283

Is there anything I can do to get Fiskmö to support SentencePiece?

@TommiNieminen
Copy link
Collaborator

All the latest OPUS models that can be used with the local MT engine are actually segmented with SentencePiece, but they don't use Marian's inbuilt SP, preprocessing with the SP script is used instead. The reason for this is that the models all use guided alignment, which is not directly supported by the inbuilt SP (more here).

So Fiskmö (very soon to be rebranded completely as OPUS-CAT) has been built on the assumption that the models have been trained on texts that have been preprocessed with SP. The error message indicates I've also not compiled the included marian executable with SP support, I probably assumed it wouldn't be used. I'll change that in the next release (assuming SP support works for Windows without problems), since I can see that people might want to use models trained with the inbuilt SP support.

@AHSteel
Copy link
Author

AHSteel commented Nov 16, 2020

Thanks for the explanation, Tommi. Look forward to the next release.

@TommiNieminen TommiNieminen changed the title mini-batch and maxi-batch settings (Fiskmö engine + Trados plugin) Compile Marian with SP integration Apr 30, 2024
@TommiNieminen TommiNieminen added the enhancement New feature or request label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants