Information retrieval (IR) is finding material (usually documents) of an unstructured nature . . . that satisfies an information need from within large collections (usually stored on computers).
Basic assumptions of IR
- Collection: Fixed set of documents
- Goal: Retrieve documents with information that's relevant to the user's information need and helps the user to complete a task
This project has been developed on windows 10 pro, visual studio 2019 and C#
The Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most adopted one. It is used by many IR systems to this day
In the Boolean retrieval model we can pose any query in the form of a Boolean expression of term i.e., one in which terms are combined with the operators and, or, and not.
Basic Assumption of Boolean Model
- An index term is either present(1) or absent(0) in the document
- Queries are Boolean combinations of index terms.
- X AND Y: represents doc that contains both X and Y
- X OR Y: represents doc that contains either X or Y
- NOT X: represents the doc that do not contain X
Brutus AND Caesar AND NOT Calpurnia
Suppose you wanted to determine which plays of Shakespeare contain the words Brutus and Caesar and not Calpurnia. One way to do that is to start at the beginning and to read through all the text,
The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as GREPPING through text
-
The way to avoid lineraly scanning the text for each query is to
index
the documents in advance. -
Suppose we record for each document whether it contains each word out of all the words that may be used.
-
The result is a binary term-document
incidence matrix
. -
Terms are the indexed units. They are usually words.
-
Documents to be indexed.
-
Token stream
-
Modefied tokens (Stemming).
-
Indexer (incidence matrix).
You can find source code here, download and run on visual studio's default dlls