Search-Engine

Information Retrieval model built on Wikipedia for our Information Retrieval class 2023'

This project was built based on Wikipedia. The preproccessed data is available at: https://console.cloud.google.com/storage/browser/wiki_preproccess

The proccesed Indices are available at: https://console.cloud.google.com/storage/browser/wiki_irt_data

Getting started:

follow instructions in run_frontend_in_gcp.sh to start a instance on GCP

upload search engine files to instance

!python3 search_frontend.py

Method Overview

Data:

Title, body and anchor Inverse Index both stemmed and clean
Document title, length, Page Rank and Page View Indices
Word Vectors index

Retrieval:

Tokenize and stem query
Retrieve posting lists
Expand Query
Calculate binary score for title
Calculate BM25 score for text
Retrieve the top scoring documents for both methods

Ranking:

Merge results of both methods
Introduce Page Rank and Page view score
Reorder based on the joint score
Return top ranking documents

Evaluation

Manual evaluation: Do results represent good retrieval?
Optimize on Map@40

Noticeably features:

Query expansion using Word2Vec
Parallelization
User based - system can improve overtime

The main search engine is based off the following logic:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Evaluation		Evaluation
GCP helpers		GCP helpers
Search Engine		Search Engine
Word2Vec		Word2Vec
README.md		README.md
Report.docx		Report.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search-Engine

Getting started:

Method Overview

About

Releases

Packages

Contributors 2

Languages

amaruy/Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Search-Engine

Getting started:

Method Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages