Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird results when translating english to finnish (using EasyNMT with opus-mt) #55

Open
kauttoj opened this issue Feb 4, 2022 · 2 comments

Comments

@kauttoj
Copy link

kauttoj commented Feb 4, 2022

While translating English to Finnish using your model via EasyNMT, I noticed something weird. Check this code and the results.

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

text='''Religion and theology is the study of religious beliefs, concepts, symbols, expressions and texts of spirituality.
Programmes and qualifications with the following main content are classified here:
Religious history
Study of sacred books
Study of different religions
Theology
=== Inclusions
Included in this detailed field are programmes for children and young people.'''

print(model.translate(text,target_lang='fi'))

The output is:

'Uskonto ja teologia tutkivat uskonnollisia käsityksiä, käsitteitä, symboleja, ilmaisuja ja tekstejä hengellisyydestä.
Ohjelmat ja tutkinnot, joiden pääsisältö on seuraava:
Uskonnollinen historia
Pyhien kirjojen tutkiminen
Eri uskontojen tutkiminen
Teologia
Suomennos: Michael T. Francis Pinmontagne SUBHEAVEN.ORG
Tähän yksityiskohtaiseen kenttään kuuluvat lasten ja nuorten ohjelmat.'

So "=== Inclusions" is translated into "Suomennos: Michael T. Francis Pinmontagne SUBHEAVEN.ORG".

What is going on here? Is this a problem with Opus-MT model or its EasyMT implementation?

PS. The sample text is from ESCO ontology

@jorgtied
Copy link
Member

jorgtied commented Feb 7, 2022

Yes, that looks a bit weird. The model at huggingface does not seem to handle that kind of input well. At least a newer OPUS-MT model does not do that anymore. You can try it here: https://translate.ling.helsinki.fi/ui/memad
It should be from this model: https://object.pouta.csc.fi/Tatoeba-MT-models/eng-fin/opusTCv20210807+bt-2021-12-08.zip

@kauttoj
Copy link
Author

kauttoj commented Feb 8, 2022

Thanks for the reply. I was able to solve the problem by using the new Tatoeba model.

Just in case someone has the same problem, just follow these instructions to convert Tatoeba models into Hugginface format:
https://github.com/huggingface/transformers/tree/master/scripts/tatoeba

Then you can use the model with this code (copied from here):

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_MODEL)
# Initialize the model
model = AutoModelForSeq2SeqLM.from_pretrained(PATH_TO_CONVERTED_MODEL)
# Tokenize text
text = "Hello my friends! How are you doing today?"
tokenized_text = tokenizer.prepare_seq2seq_batch([text], return_tensors='pt')
# Perform translation and decode the output
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
# Print translated text
print(translated_text)

PS. Conversion worked only for "eng-fin" model, while "fin-eng" failed because of some dimension mismatch error: "raise ValueError(f"Hidden size {hidden_size} and configured size {cfg['dim_emb']} mismatched or not 512") KeyError: 'dim_emb'"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants