Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added max_seq_length and batch_size params to embeddingretriever #1817

Merged
merged 10 commits into from
Nov 29, 2021
Merged

Added max_seq_length and batch_size params to embeddingretriever #1817

merged 10 commits into from
Nov 29, 2021

Conversation

AhmedIdr
Copy link
Contributor

@AhmedIdr AhmedIdr commented Nov 26, 2021

Added the option to choose max_seq_length and batch_size when using EmbeddingRetriever.
Related to #1793
For RetriBERT I am not totally sure if I added the params in the right places.
Additionally, added a progress_bar to Faiss writing_documents method, since it is hard to keep track with the progress when working with large datasets.
Related to #1110

I don't usually make any pull requests, so I hope didn't mess anything up.

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

@@ -55,7 +55,7 @@ def __init__(
retriever.embedding_model, revision=retriever.model_version, task_type="embeddings",
extraction_strategy=retriever.pooling_strategy,
extraction_layer=retriever.emb_extraction_layer, gpu=retriever.use_gpu,
batch_size=4, max_seq_len=512, num_processes=0,use_auth_token=retriever.use_auth_token
batch_size=retriever.batch_size, max_seq_len=retriever.max_seq_len,, num_processes=0,use_auth_token=retriever.use_auth_token
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a superfluous , here.

@julian-risch
Copy link
Member

Hi @AhmedIdr thanks for raising this pull request. I had a first brief look at your code and started our tests. That way, I identified a syntax error with a superfluous comma. Could you please fix that so that we can run the tests? Thank you! I will review the code more thoroughly once the tests are running. 👍

@AhmedIdr
Copy link
Contributor Author

AhmedIdr commented Nov 26, 2021

Hey @julian-risch,
Sorry, I didn't notice the typos before creating the pull request. I tested the changes on another machine, so I didn't pay attention to some typos when copy and pasting. Hope everything is working now.

@@ -611,7 +611,7 @@ class EmbeddingRetriever(BaseRetriever)
#### \_\_init\_\_

```python
| __init__(document_store: BaseDocumentStore, embedding_model: str, model_version: Optional[str] = None, use_gpu: bool = True, model_format: str = "farm", pooling_strategy: str = "reduce_mean", emb_extraction_layer: int = -1, top_k: int = 10, progress_bar: bool = True, devices: Optional[List[Union[int, str, torch.device]]] = None, use_auth_token: Optional[Union[str,bool]] = None)
| __init__(document_store: BaseDocumentStore, embedding_model: str, model_version: Optional[str] = None, use_gpu: bool = True, batch_size: int = 16, max_seq_len: int = 128, model_format: str = "farm", pooling_strategy: str = "reduce_mean", emb_extraction_layer: int = -1, top_k: int = 10, progress_bar: bool = True, devices: Optional[List[Union[int, str, torch.device]]] = None, use_auth_token: Optional[Union[str,bool]] = None)
Copy link
Member

@julian-risch julian-risch Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that the previous default max_seq_len was 512. Is there any particular reason why you would prefer 128 as the default? If not I'd say we keep 512. (The change in the default value is probably also why some test cases fail right now but I could fix that if we decide to use 128 instead of 512.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there was no reason, we can set it back to 512 and batch size back to 32.

AhmedIdr and others added 2 commits November 27, 2021 19:45
Changed default batch_size and max_seq_len in EmbeddingRetriever
@julian-risch
Copy link
Member

Hi @AhmedIdr the code looks good to me so far. Thanks for changing the default parameters. Right now, I tried your changes in this colab notebook: https://colab.research.google.com/github/AhmedIdr/haystack/blob/add-params/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb but I couldn't see the progress bar when running write_documents(). Have you tried that in a notebook?

@AhmedIdr
Copy link
Contributor Author

AhmedIdr commented Nov 29, 2021

Hey @julian-risch, no, I didn't, I only tried it using a script. But it is basically implemented similar to the other methods and like it was suggested in #1110. I'll try to run it in a notebook and see if I figure out what the problem is.

@AhmedIdr
Copy link
Contributor Author

AhmedIdr commented Nov 29, 2021

Changing from tqdm.auto import tqdm back to from tqdm import tqdm seems to solve the problem.

@julian-risch
Copy link
Member

julian-risch commented Nov 29, 2021

My mistake, I didn't change the pip install statement in the notebook to your branch. It needs to be !pip install git+https://github.com/AhmedIdr/haystack.git@add-params 😅 or even !pip install git+https://github.com/AhmedIdr/haystack.git@09f89bc if you would like to check your previous commit from 2 days ago.

@AhmedIdr
Copy link
Contributor Author

Ahh I see, so using tqdm.auto is also working. Do you prefer to change it back to tqdm.auto or keep tqdm?

@julian-risch
Copy link
Member

Ahh I see, so using tqdm.auto is also working. Do you prefer to change it back to tqdm.auto or keep tqdm?

Yes, please change it back to tqdm.auto. Thank you.

@AhmedIdr
Copy link
Contributor Author

Okay done, hope nothing went wrong when fetching the new commits from the main branch 😅

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and ready to merge. 👍
Thank you so much for your contribution to haystack! Having your pull request merged is a great achievement.

@julian-risch julian-risch merged commit 56e4e84 into deepset-ai:master Nov 29, 2021
@AhmedIdr
Copy link
Contributor Author

Thank you and sorry it took a bit of time for everything to work properly 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants