Skip to content

Latest commit

 

History

History
39 lines (25 loc) · 2.49 KB

sparse-bm25.md

File metadata and controls

39 lines (25 loc) · 2.49 KB

Sparse retrieval: BM25

We have already introduced the simplest sparse representations for retrieval: Bag-of-words and TF-IDF.

Let's recall the idea of TF-IDF:

Build a vocabulary composed of all the words in your collection of documents. For each document, construct a vector with the same length as the vocabulary.

  • Put a 0 for each missing word.
  • If the word is present, put a weight that is greater the more the word is repeated in the document and is less the more the term is repeated within the corpus

BM25 was developed in the 70s and 80s as an evolution of the TF-IDF. This algorithm represents the de facto standard for sparse retrieval. It is commonly and effectively used in real-world search engines. One of its most common implementations is found in Apache Lucene, the engine 🚂 that powers Elasticsearch, OpenSearch and Apache Solr!

BM25 Formula

The formula may sound scary but in practice, it focuses on two aspects that were ignored by TF-IDF: term saturation and document length normalization.

Term Saturation

Is a word that appears in the document 100 times much more relevant than another word that appears 20 times?

We can reasonably assume that if a term occurs more times than a certain saturation threshold, it does not make the document more relevant to that term.

In BM25, the term saturation is determined by the parameter k1.

Document length normalization

A certain document contains the word "cake" once and is one page long. Another document contains the same word once but is 100 pages long. Which document is more relevant to the query "cake"? 🍰

BM25 takes document length into account, penalizing longer documents through the parameter b.

Resources