GitHub - sharmilathirumalai/TF-IDF: IR implemented by using TF-IDF method

TF-IDF

Retrieved the top most ranking document from Reuters dataset for the given query by using TF-IDF IR method.

Data Extraction

Followed ETL (Extract, Transform and Load) method to extract data.

Extract - Extracted each article from “SGM” files to be kept as separate documents by writing custom parser. Total number of documents extracted: 19,043 articles
Transform - Each document is converted to be in the form of

{
ID: <NewID>
Date: <Article Date>
Ttitle: <Article Title>
Content: <Article Body>
}

by scraping the string.

Load - Finally the transformed data is loaded into Mongo DB

Data Analysis - IR

The tf-idf score of each document is computed by adding the tf-idf score of title and content attribute in the document. Calculated the cosine value and distance for each document and the query. By doing so, the top ranked document for the query canada is found to be as follows:

{
"ID": 11751,
"Date": "Tue Mar 31 18:38:11 AST 1987",
"Title": "COMINCO &lt;CLT> SELLS STAKE IN CANADA METAL",
"Article": "Cominco Ltd said itsold its 50 pct stake in Canada Metal Co Ltd to Canada Metalsenior management for an undisclosed sum.    Cominco said the sale was part of its previously announcedpolicy of divesting non-core businesses. Canada Metal is a Toronto-based producer of lead alloys andengineered lead products. Canada Metal production figures were not immediatelyavailable."
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.settings		.settings
src		src
.classpath		.classpath
.gitattributes		.gitattributes
.gitignore		.gitignore
.project		.project
README.md		README.md
en-pos-maxent.bin		en-pos-maxent.bin
extract.txt		extract.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-IDF

Data Extraction

Data Analysis - IR

About

Releases

Packages

Languages

sharmilathirumalai/TF-IDF

Folders and files

Latest commit

History

Repository files navigation

TF-IDF

Data Extraction

Data Analysis - IR

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages