Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first version of save_to_remote for HF from FarmReader #2618

Merged
merged 18 commits into from
Jul 4, 2022

Conversation

TuanaCelik
Copy link
Member

Fixes #2416

This draft PR introduces the save_to_remote() function to the FarmReader
It converts a model to the transformer models format using the convert_to_transformer method from the AdaptiveModel class.
I have tested this by fine-tuning a model and then calling save_to_remote - then loading that same model back into FarmReader and using it in an ExtractiveQuestionAnswering pipeline

(PS: sentence-transformers implements this in their own repo and I was able to reuse their method of tracking files for lfs)

To do and test

  • I am unsure how this would behave in the case the repo already exists on HF. The create_repo method from huggingface_hub has an exists_ok argument which I've set to True but this still needs testing
  • The draft/default model card
  • Testing that this structure of model is ok and can generalise to other types of models(?) that could be trained with FarmReader
  • Once we're happy with this, we would need the same for the other two classes: DensePassageRetriever and TableTextRetriever

Status (please check what you already did):

@TuanaCelik TuanaCelik requested a review from tstadel May 31, 2022 23:17
@ZanSara
Copy link
Contributor

ZanSara commented Jun 8, 2022

@TuanaCelik I recently merged a drastic change to the CI, so I recommend you merge master before proceeding with this one. Thanks 😊

@tstadel
Copy link
Member

tstadel commented Jun 8, 2022

@TuanaCelik do you have some test notebook / code to try this out?

@tstadel
Copy link
Member

tstadel commented Jun 9, 2022

So usage would be:

from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
data_dir = "data"

reader.train(data_dir=data_dir, train_filename="squad_small.json", use_gpu=True, n_epochs=1, save_dir="my_model", batch_size=4)

reader.save_to_remote(model_name='model_from_haystack', private=True)

And reloading the model from HF

new_reader = FARMReader(model_name_or_path='HF_USER_NAME/model_from_haystack', use_auth_token=True)

Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks already pretty good! Let's not save FARM-related stuff and add some docstrings. Then it's good to go for a first quick-win version.

haystack/nodes/reader/farm.py Show resolved Hide resolved
haystack/nodes/reader/farm.py Outdated Show resolved Hide resolved
haystack/nodes/reader/farm.py Show resolved Hide resolved
@tstadel
Copy link
Member

tstadel commented Jun 9, 2022

If the repo already exists it will overwrite the model there. So as a next iteration, adding a version parameter would be an option.

@TuanaCelik
Copy link
Member Author

Thanks @tstadel - I'll add the docstrings and change the self.save as suggested. 2 things I would like to clarify:

  1. For this PR shall we keep it condensed to adding this functionality to just FARMReader and leave the others (DensePassageRetriever and TableTextRetriever) for a second iteration?
  2. You mentioned 'as a next iteration, adding a version parameter would be an option' -> again, in this PR or a following one?

In short, these could be smaller PRs and changes later. What would be best? Do all in one or merge this simple version with just FARMReader first (after some minor changes + docstrings)

@tstadel
Copy link
Member

tstadel commented Jun 9, 2022

Thanks @tstadel - I'll add the docstrings and change the self.save as suggested. 2 things I would like to clarify:

  1. For this PR shall we keep it condensed to adding this functionality to just FARMReader and leave the others (DensePassageRetriever and TableTextRetriever) for a second iteration?
  2. You mentioned 'as a next iteration, adding a version parameter would be an option' -> again, in this PR or a following one?

In short, these could be smaller PRs and changes later. What would be best? Do all in one or merge this simple version with just FARMReader first (after some minor changes + docstrings)

I'd say let's do all of the extensions in separate PRs. This PR would already create value if we released it tomorrow.

@tstadel
Copy link
Member

tstadel commented Jun 22, 2022

@TuanaCelik do you have time to work on this? If not I could take over from here...

@TuanaCelik
Copy link
Member Author

@tstadel - working on it this week so can be left with me :)

@TuanaCelik
Copy link
Member Author

@ZanSara @tstadel - I've made changes based on the comments. 2 things:
I've referenced sentence-transformers repo in the docstring but this can be moved into a comment.
I can imagine moving the git-lfs check and tracking section to utils to be re-used in the other save_to_remote'esque functions we'll have for the other nodes. Could be moved when needed and we're adding more of these functions

@TuanaCelik TuanaCelik marked this pull request as ready for review June 22, 2022 17:01
@TuanaCelik TuanaCelik requested a review from tstadel June 22, 2022 17:03
Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that pylint spotted this during CI: we should avoid positional args when calling create_repo. Please switch to keyword args.

haystack/nodes/reader/farm.py Outdated Show resolved Hide resolved
@tstadel
Copy link
Member

tstadel commented Jul 1, 2022

Sorry to bother you again. I executed my test script again and I got the following warning:

FutureWarning: `name` and `organization` input arguments are deprecated and will be removed in v0.10. Pass `repo_id` instead.

Let's briefly discuss whether we want to use repo_id (would be e.g. "deepset/my_model_name") in the first place instead of model_name (would be "my_model_name") and hf_organization (would be "deepset") params. I think it would be easier to just have the repo_id param. We don't need to add any additional conversion logic when v0.10 is finally out.
@TuanaCelik what do you think?

@julian-risch
Copy link
Member

Please also have a look at this comment on storing models in FARM format on the model hub and failing loading: https://github.com/deepset-ai/haystack-hub-api/issues/977#issuecomment-1173439438

@tstadel
Copy link
Member

tstadel commented Jul 4, 2022

Please also have a look at this comment on storing models in FARM format on the model hub and failing loading: deepset-ai/haystack-hub-api#977 (comment)

@julian-risch we should be safe here as we first convert the model to transformers and then upload.

@TuanaCelik
Copy link
Member Author

We changed the huggingface-hub version requirements in setup.cfg with @tstadel to acomodate both for audio and FARMReader's save_to_remote()

Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tstadel tstadel merged commit 2a8b129 into master Jul 4, 2022
@tstadel tstadel deleted the hf-integration-tuana branch July 4, 2022 13:39
Krak91 pushed a commit to Krak91/haystack that referenced this pull request Jul 26, 2022
* first version of save_to_remote for HF from FarmReader

* Update Documentation & Code Style

* Changes based on comments

* Update Documentation & Code Style

* imports order

* making small changes to pydoc

* indent fix

* Update Documentation & Code Style

* keyword arguments instead of positional

* Changing to repo_id

huggingface-hub package would have to be v0.5 or higher - checking how to handle with Thomas

* Update Documentation & Code Style

* adding huggingface-hub dependency 0.5 or above

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Autoupload models to huggingface hub after training
4 participants