Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get embeddings from a dense retriever #226

Closed
Krak91 opened this issue Jul 14, 2020 · 5 comments
Closed

How to get embeddings from a dense retriever #226

Krak91 opened this issue Jul 14, 2020 · 5 comments

Comments

@Krak91
Copy link
Contributor

Krak91 commented Jul 14, 2020

Hi, I've been trying to find out why the embeddingretriever is returning just an empty list and found out that the embeddings were indeed generated, bur the retrieve() method is trying to return documents instead? I had to change the method to look like this:
(haystack/retriever/dense lines 245-247)
def retrieve(self, query: str, filters: dict = None, top_k: int = 10, index: str = None) -> List[Document]:
if index is None:
index = self.document_store.index
query_emb = self.embed(texts=[query])
# documents = self.document_store.query_by_embedding(query_emb=query_emb[0], filters=filters,
# top_k=top_k, index=index)
# return documents
return query_emb

I'm trying to understand this function's purpose - is it meant to return documents relevant to our input strings? Why is this trying to return documents?

@tholor
Copy link
Member

tholor commented Jul 14, 2020

Yes, the goal of the retrieve() method is to return a list of Documents that are "similar" to our query. In the case of the DensePassageRetriever this means comparing the embedding of our query (query_emb) to the ones of the documents. Those document embeddings should have been previously created and stored in the DocumentStore as in this Tutorial:

# Important:
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation.
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once.
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
document_store.update_embeddings(retriever)

@Krak91
Copy link
Contributor Author

Krak91 commented Jul 14, 2020

In the case that we just want to generate embeddings for strings, could we go about passing the function an empty document store? (given that we return the embeddings instead of the docs as above)

@tholor
Copy link
Member

tholor commented Jul 14, 2020

If you just want to create embedding you can use:

# queries
retriever.embed_queries(list_of_strings)
# passages
retriever.embed_passages(list_of_strings)

Note, that for DPR the two methods use different encoder models while for the EmbeddingRetriever both use the same.

@Krak91
Copy link
Contributor Author

Krak91 commented Jul 14, 2020

Great! thanks

@tholor tholor changed the title EmbeddingRetriever.retrieve() How to get embeddings from a dense retriever Jul 14, 2020
@tholor tholor closed this as completed Jul 14, 2020
@salbatarni
Copy link

hello
looks like retriever.embed_passages(list_of_strings) is not working anymore...
so how to get the embedding of a passage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants