Add rerank and sentence-similarity tasks to text embedding module #235

markstur · 2023-10-14T07:08:50Z

Add rerank and sentence-similarity tasks to text embedding module.
This PR is stacked on the first embedding PR #224.
Since embedding, sentence-similarity, and rerank all work with a sentence-transformer model this module will load a model and is able to run multiple tasks.

In addition to the 3 (embed, rerank, sentence-similarity), there are another 3 so that each is not limited to a single input string. In the case of sentence-similarity and rerank this means that a list of source_sentences or queries (respectively) are each applied against the same list of sentences or documents. So there is a real benefit many queries can be sent against a large collection of documents. In the case of embeddings, this is simple batching.

Text Embedding Module

Implements the following tasks:

EmbeddingTask: Returns an embedding from an input text string
EmbeddingsTasks: EmbeddingTask but with a list of inputs producing a list of outputs
SentenceSimilarityTask: Compare one source sentence to a list of sentences
SentenceSimilarityTasks: SentenceSimilarityTask but with a list of source sentences producing
a list of outputs
RerankTask: Return top_n documents ordered by relevance given a query
RerankTasks: RerankTask but with a list of queries producing a list of outputs

More details for sentence-similarity and rerank...

sentence-similarity is a common and simple concept (see Hugging Face or Sentence Transformers)

class SentenceSimilarityTask(TaskBase):
"""Compare the source_sentence to each of the sentences.
Result contains a list of scores in the order of the input sentences.
"""

@task(
required_parameters={"source_sentences": List[str], "sentences": List[str]},
output_type=SentenceListScores
)
class SentenceSimilarityTasks(TaskBase):
"""Compare each of the source_sentences to each of the sentences.
Returns a list of results in the order of the source_sentences.
Each result contains a list of scores in the order of the input sentences.
"""

rerank is less intuitive, but is popular for RAG and chaining. One of the more popular rerank APIs is from Cohere. This implementation is similar to their API.

class RerankTask(TaskBase):
"""Returns an ordered list ranking the most relevant documents for the query

Required parameters:
    query: The search query
    documents: JSON documents containing "text" or alternative "_text" to search
Returns:
    The top_n documents in order of relevance (most relevant first).
    For each, a score and document index (position in input) is returned.
    The original document JSON is returned depending on optional args.
    The top_n optional parameter limits the results when used.
"""

class RerankTasks(TaskBase):
"""Returns an ordered list for each query ranking the most relevant documents for the query

Required parameters:
    queries: The search queries
    documents: JSON documents containing "text" or alternative "_text" to search
Returns:
    Results in order of the queries.
    In each query result:
        The query text is optionally included for visual convenience.
        The top_n documents in order of relevance (most relevant first).
        For each, a score and document index (position in input) is returned.
        The original document JSON is returned depending on optional args.
        The top_n optional parameter limits the results when used.
"""

gkumbhat · 2023-10-16T20:47:16Z

caikit_nlp/modules/reranker/rerank_task.py

+ },
+ output_type=RerankPrediction,
+)
+class RerankTask(TaskBase):


nit: Can you please add a bit of description of what this task is supposed to do at either as docstring or as module docs

gkumbhat · 2023-10-16T20:50:03Z

caikit_nlp/data_model/reranker.py

+class RerankScore(DataObjectBase):
+ """The score for one document (one query)"""
+
+ document: JsonDict


Why does document needs to be JsonDict ?

the desired API is a document that is JSON with a text (or alternative _text) field that is used for ranking while the rest of the document is typically returned reranked. JsonDict works for me for gRPC and REST while allowing different types input/output even nested. I'm not sure if you are recommending a preferred alternative.

If you are just wondering why not only pass text and return index, then I understand (agree) but that isn't the requested API for the rerank use case.

gkumbhat · 2023-10-16T20:52:36Z

caikit_nlp/data_model/reranker.py

+class RerankQueryResult(DataObjectBase):
+ """Result for one query in a rerank task"""
+
+ scores: List[RerankScore]
+
+
+@dataobject(package="caikit_data_model.caikit_nlp")
+@dataclass
+class RerankPrediction(DataObjectBase):
+ """Result for a rerank task"""
+
+ results: List[RerankQueryResult]


Is the results here for 1 query or 1 document, and what is the relation between 1 query and 1 document and 1 result ?

1 query with the top_n document results in order of relevance for that query.
Edit: I was looking at the wrong part of the code snippet. 1 query for n docs is my explanation for RerankQueryResult.

I think the question was about RerankPrediction which is a list of RerankQueryResult corresponding to the input list of queries.

I will expand the docstring and rename RerankPrediction --> RerankPredictions