Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new notebooks that do not rely on JW300 #200

Open
cdleong opened this issue Oct 21, 2021 · 7 comments
Open

Create new notebooks that do not rely on JW300 #200

cdleong opened this issue Oct 21, 2021 · 7 comments

Comments

@cdleong
Copy link
Contributor

cdleong commented Oct 21, 2021

Slack discussion: https://masakhane-nlp.slack.com/archives/C01JAP67HRV/p1634844082006400

image

https://github.com/joeynmt/joeynmt/blob/master/joey_demo.ipynb is the Tatoeba example.

@cdleong
Copy link
Contributor Author

cdleong commented Oct 22, 2021

One suggestion in the slack would be to break the new notebook code into two parts

  • One notebook that takes in a HuggingFace dataset at the top, and proceeds from there to train a JoeyNMT model. This might make things a lot easier on people. If they can get data into the HuggingFace Dataset format, we can show them how to train.
  • One notebook that shows people how to do it: loads in data from various filetypes or sources (.csv, paired text files, directly from the HuggingFace hub) to HuggingFace format: https://huggingface.co/docs/datasets/loading_datasets.html

@cdleong
Copy link
Contributor Author

cdleong commented Oct 22, 2021

One suggestion in the slack would be to break the new notebook code into two parts

* One notebook that takes in a HuggingFace dataset at the top, and proceeds from there to train a JoeyNMT model. This might make things a lot easier on people. If they can get data into the HuggingFace Dataset format, we can show them how to train.

* One notebook that shows people how to do it: loads in data from various filetypes or sources (.csv, paired text files, directly from the HuggingFace hub) to HuggingFace format: https://huggingface.co/docs/datasets/loading_datasets.html

See this slack discussion: https://masakhane-nlp.slack.com/archives/C01GF5XJ0TF/p1634863777007500?thread_ts=1634844471.007300&cid=C01GF5XJ0TF

@cdleong
Copy link
Contributor Author

cdleong commented Oct 22, 2021

https://colab.research.google.com/drive/1RWOle7RHy_wq0uGWxmAq1ZfmEQIFsCHj#scrollTo=h1Ddy4_AOKdm could make for a starting point. This notebook shows how to download a HuggingFace dataset and write it out to files of the format JoeyNMT expects... I think

@whaiao
Copy link

whaiao commented Dec 12, 2022

@cdleong if this is still relevant, I would like to work on it.

@cdleong
Copy link
Contributor Author

cdleong commented Dec 12, 2022 via email

@whaiao
Copy link

whaiao commented Dec 12, 2022

Alright, so I have started with the notebook and will be done by the end of next week. I have to prepare for an exam next Wednesday, but I will be wrapping up the notebook.

/self-assign

@smyja
Copy link

smyja commented Sep 28, 2023

Alright, so I have started with the notebook and will be done by the end of next week. I have to prepare for an exam next Wednesday, but I will be wrapping up the notebook.

/self-assign

Any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants