Information Retrieval model built on Wikipedia for our Information Retrieval class 2023'
This project was built based on Wikipedia. The preproccessed data is available at: https://console.cloud.google.com/storage/browser/wiki_preproccess
The proccesed Indices are available at: https://console.cloud.google.com/storage/browser/wiki_irt_data
follow instructions in run_frontend_in_gcp.sh to start a instance on GCP
upload search engine files to instance
!python3 search_frontend.py
- Data:
- Title, body and anchor Inverse Index both stemmed and clean
- Document title, length, Page Rank and Page View Indices
- Word Vectors index
- Retrieval:
- Tokenize and stem query
- Retrieve posting lists
- Expand Query
- Calculate binary score for title
- Calculate BM25 score for text
- Retrieve the top scoring documents for both methods
- Ranking:
- Merge results of both methods
- Introduce Page Rank and Page view score
- Reorder based on the joint score
- Return top ranking documents
- Evaluation
- Manual evaluation: Do results represent good retrieval?
- Optimize on Map@40
- Noticeably features:
- Query expansion using Word2Vec
- Parallelization
- User based - system can improve overtime
The main search engine is based off the following logic: