A search engine in c++ on wikipedia data using pugi xml parser.
Create a c++ project in Visual Studio or any of your choice add teh souce code, liberaries and wikipedia data and you are good to go.
Pugi xml
Wikipedia dump data
Parse the query.
Convert words into wordIDs
Tokenizing the text data
Seek to the start of the doclist in the short barrel for every word.
Scan through the doclists until there is a document that matches all the search terms.
Compute the rank of that document for the query.
Sort the documents that have matched by rank and return the top k.
Creating document object loading xml file and creating a tree
Tokenizing the text data
Forward Indexing (ist of terms contained within a particular document) and Inverted Indexing (list of documents containing a given term)
Processing the queery
Traversing the pages ie crawling
Single word, multiword search queeries