Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to generate the mapping matrixs for the ELMo of my own language. #9

Open
scofield7419 opened this issue Sep 27, 2019 · 6 comments

Comments

@scofield7419
Copy link

scofield7419 commented Sep 27, 2019

I've trained several ELMo weights for other languages (e.g., Finnish, Chinese ...)(I consider to enrich your repo), and now I wanna align them (including each LSTM layer) to English space, as just you did.

But, it seems you did not release the codes for generating such aligning matrix (like such below).
ma

P.S.: note that you only released the code for generating the anchors, and I believe it is nothing to do with the aligning matrix.
Or, if I misunderstand the approach, please show the hints, correct my wrong.

So, may I have your prompt reply concerning this issue?
Thx a lot.

@TalSchuster
Copy link
Owner

Hi @scofield7419
That's great! I'm sure people will find the models and alignments for more languages useful.

The supervised alignment computation was done with the MUSE repository. Their repo is not on installable with pip so I think the best way is to run it with their instructions. I can create a short bash script if it helps.

Use their provided command line, for example:

python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default

Let me know if that works.
When you have the alignment, you are welcome to submit a PR with the new models and matrices. Please also report the word translation accuracies from the MUSE script to make sure that the alignment worked.

@scofield7419
Copy link
Author

scofield7419 commented Sep 27, 2019

Hi @TalSchuster , thank you for your reply.

As for MUSE, actually i'm quite familiar with that (frequently used) ^_^

If I guess correct, the first thing I should do to align one language (say Fi.) to En., is to use the get_anchors.py to generate the avg_embeds_{%i}.txt (in which they are the embeddings for each anchor words around the whole vocab) for i-th layer of the LSTM of ELMo, for both En and Fi, respectively.
And then align the corresponding embeddings for the i-th EMLo layers, for En and Fi, and output the best_mapping.pth for [0, 1, 2] layers one by one, by MUSE.

Is all above correct?
Thx again for everything.

@TalSchuster
Copy link
Owner

Yes, that sounds correct.
I've uploaded the anchors for the provided English model, so it will save you the time of extracting the English anchors. There's a link now in the main README

@scofield7419
Copy link
Author

scofield7419 commented Sep 27, 2019

BTW, there's another thing to ask:

If I wanna generate the multi-lingual ELMo embedding, (I mean the real one, like multi-Bert, bert, multi-bert, not through alignment ), can I just blend enough numbers of sentences from different languages (say, including 10 languages) as the training data for training the ELMo?

Specifically, I may prepare considerable sentences for each languages (just consider 50m sentences for each languages respectively). So is this applicable for generating the real multi-lingual ELMo? And is the training sentences for each lingual sufficient enough?

@TalSchuster
Copy link
Owner

For joint-training, you can check this paper.
In short - In average it can provide better results but the effectiveness varies across languages. Still, mostly it is worth learning and applying an alignment after the joint-training since even if you train it jointly, there is no strong constraint that makes the cross-lingual representations aligned.

@scofield7419
Copy link
Author

Thank you for your response! : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants