Skip to content

Commit

Permalink
proposal: New EmbeddingRetriever for Haystack 2.0 (#3558)
Browse files Browse the repository at this point in the history
* Add EmbeddingRetriever proposal

* Update with Sara's feedback

* Consistent naming
  • Loading branch information
vblagoje committed Dec 12, 2022
1 parent 77cea8b commit c28f668
Showing 1 changed file with 129 additions and 0 deletions.
129 changes: 129 additions & 0 deletions proposals/text/3558-embedding_retriever.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
- Start Date: 2022-11-11
- Proposal PR: https://github.com/deepset-ai/haystack/pull/3558
- Github Issue:
# Summary

- Current EmbeddingRetriever doesn't allow Haystack users to provide new embedding methods and is
currently constricted to farm, transformers, sentence transformers, OpenAI and Cohere based
embedding approaches. Any new encoding methods need to be explicitly added to Haystack
and registered with the EmbeddingRetriever.


- We should allow users to easily plug-in new embedding methods to EmbeddingRetriever. For example, a Haystack user should be able to
add custom embeddings without having to commit additional code to Haystack repository.

# Basic example
EmbeddingRetriever is instantiated with:

``` python
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
model_format="sentence_transformers",
)
```
- The current approach doesn't provide a pluggable abstraction point of composition but
rather attempts to satisfy various embedding methodologies by having a lot of
parameters which keep ever expanding.


- The new approach allows creation of the underlying embedding mechanism (EmbeddingEncoder)
which is then in turn plugged into EmbeddingRetriever. For example:

``` python
encoder = SomeNewFancyEmbeddingEncoder(api_key="asdfklklja",
query_model="text-search-query",
doc_model="text-search-doc")
```

- EmbeddingEncoder is then used for the creation of EmbeddingRetriever. EmbeddingRetriever
init method doesn't get polluted with additional parameters as all of the peculiarities
of a particular encoder methodology are contained on in its abstraction layer.

``` python
retriever = EmbeddingRetriever(
document_store=document_store,
encoder=encoder
)
```

# Motivation

- Why are we doing this? What use cases does it support? What is the expected outcome?

We could certainly keep the current solution as is; it does implement a decent level
of composition/decoration to lower coupling between EmbeddingRetriever and the underlying
mechanism of embedding (sentence transformers, OpenAI, etc). However, the current mechanism
in place basically hard-codes available embedding implementations and prevents our users from
adding new embedding mechanism by themselves outside of Haystack repository. We also might
want to have a non-public dC embedding mechanism in the future. In the current design a non-public
dC embedding mechanism would be impractical. In addition, the more underlying implementations we
add we'll continue to "pollute" EmbeddingRetriever init method with more and more parameters.
This is certainly less than ideal long term.


- EmbeddingEncoder classes should be subclasses of BaseComponent! As subclasses of BaseComponent,
we can use them outside the EmbeddingRetriever context in indexing pipelines, generating the
embeddings. We are currently employing a kludge of using Retrievers which is quite counter-intuitive
and confusing for our users.


- EmbeddingEncoder classes might sound overly complicated, especially with a distinguishing mechanism
name pre-appended (i.e CohereEmbeddingEncoder). Therefore, we'll adopt <specific>Embedder
naming scheme, i.e. CohereEmbedder, SentenceTransformerEmbedder and so on.

# Detailed design

- Our new EmbeddingRetriever would still wrap the underlying encoding mechanism in the form of
_BaseEmbedder. _BaseEmbedder still needs to implement methods:
- embed_queries
- embed_documents


- The new design approach differs is in the creation of EmbeddingRetriever - rather than hiding the underlying encoding
mechanism one could simply create the EmbeddingRetriever with a specific encoder directly. For example:

```
retriever = EmbeddingRetriever(
document_store=document_store,
encoder=OpenAIEmbedder(api_key="asdfklklja", model="ada"),
#additional EmbeddingRetriever-abstraction-level parameters
)
```

- If the "two-step approach" of EmbeddingRetriever initialization is no longer the ideal solution (issues with current
schema generation and loading/saving via YAML pipelines) we might simply add the EmbeddingRetriever
class for every supported encoding approach. For example, we could have OpenAIEmbeddingRetriever, CohereEmbeddingRetriever,
SentenceTransformerEmbeddingRetriever and so on. Each of these retrievers will delegate the bulk of the work to an
existing EmbeddingRetriever with a per-class-specific Embedder set in the class constructor (for that custom
encoding part). We'll get the best of both worlds. Each <Specific>EmeddingRetriever will have only the relevant primitives
parameters for the **init()** constructor; the underlying EmbeddingRetriever attribute in <Specific>EmeddingRetriever
will handle most of the business logic of retrieving, yet each retriever will use an appropriate per-class-specific
Embedder for the custom encoding part.



# Drawbacks
- The main shortcoming are:
- The "two-step approach" in EmbeddingRetriever initialization
- Likely be an issue for the current schema generation and loading/saving via YAML pipelines (see solution above)
- It is a API breaking change so it'll require code update for all EmbeddingRetriever usage both in our codebase and for Haystack users
- Can only be done in major release along with other breaking changes

# Alternatives

We could certainly keep everything as is :-)

# Adoption strategy
- As it is a breaking change, we should implement it for the next major release.

# How do we teach this?
- This change would require only a minor change in documentation.
- The concept of embedding retriever remains, just the mechanics are slightly changed
- All docs and tutorials need to be updated
- Haystack users are informed about a possibility to create and use their own embedders for embedding retriever.
- # Unresolved questions

Optional, but suggested for first drafts. What parts of the design are still
TBD?

0 comments on commit c28f668

Please sign in to comment.