Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] DataCollatorForTextInfilling #12370

Conversation

ionicsolutions
Copy link
Contributor

@ionicsolutions ionicsolutions commented Jun 26, 2021

What does this PR do?

A DataCollator for the BART "Text Infilling" pre-training task.

The implementation borrows ideas from fairseq's more complex DenoisingDataset.

Fixes #5428
(Addresses #5096)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ionicsolutions
Copy link
Contributor Author

It's still on my agenda to brush this up

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Aug 30, 2021
@salrowili
Copy link

salrowili commented Mar 19, 2022

This is a wonderful effort. any update on this? also if you can add TF call that would be great.

@ionicsolutions
Copy link
Contributor Author

@salrowili Sadly, I didn't find time for it. I'm also not sure whether this still fits with the library, there might have been some updates to the data collators in the meantime.

I'm still interested in working on this but realistically I won't have time to do that unless I need it for an ongoing project. Would be up for a collaboration?

@salrowili
Copy link

salrowili commented Mar 30, 2022

@ionicsolutions Thanks for replying back. What about BartForConditionalGeneration? is it enough to train BART from scratch like in this example https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_mlm_flax.py#L241 . However, as you can see it uses FlaxDataCollatorForLanguageModeling which i am not sure if it uses text in filling task?
maybe you can check this repo also https://github.com/cosmoquester/transformers-bart-pretrain . He already implemented text in filling task but with tensorflow dataset. However, this repo does not work with HF >4.11 because of some logit issue. Maybe you can contact the author of this repo and asks his permission to use his function and collaboration if he is willing to do. He is probably better than me in pushing this project forward. However, what I can help is that I can test any function you develop in this project in scale (e.g. pre-training it on BART-large from scratch) and see how it will perform and share colab example with research community. What i like about BART over T5 is the inference time and memory usage during fine-tuning and it also can achieve SOTA on SQuAD and GLUE in addition to generative tasks (e.g. summarization) so i think this project is much needed from research community.

@jbmaxwell
Copy link

@salrowili I'm also interested in infilling generation and was wondering if you've made any progress? I see your last post was three weeks ago, so I'm wondering if maybe you found an alternative approach?

@salrowili
Copy link

salrowili commented Apr 22, 2022

@jbmaxwell I try out BART implementation of FLAX, XLA with TPU and Keras BART @ https://github.com/cosmoquester/transformers-bart-pretrain . Keras BART is my best model among those and hence that why i was looking for textinfliing. I think also the implementation of BART is not optimal with the hugging face library, especially for BART large. I am also working with fairseq now and torch xla and I think this will be the best among all variety that I tried out. I suggest for you ask for TPU access from google https://sites.research.google/trc/ and try out fairseq xla with BART but fix the dynamic shape by using pre-defined input shape in my frok https://github.com/salrowili/fairseq. You can see latest commits to see what changes I made. with TPUv3-8 and BART will get a speed of ~100k wps but you need to keep the log interval 10 and num_bucket=5. I run BART on my 3090 and it gives me a speed of 30K wps. 100k wps translate to ~20K steps/day which is slow compared to BERT with TF (~125K stepts/day) with batch size of 256 and max. seq. length of 512. which means it will take you around one month to finish 500K steps with BART (:
If you find an alternative solution or you are willing to improve BART implementation with text filling and JAX, TF it would be good if you share your solution as i share mine (:

@jbmaxwell
Copy link

I hadn't seen this before—thanks or the link!
I'll give it a try.
I'm working with compact, non-natural language inputs and small datasets (for now), and generally reduce model sizes significantly from the stock versions, so I'm not too worried about training resources. Faster is better, of course, but not a deal-breaker for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to use (and preferably finetune) BART for text infilling?
3 participants