FirstStoryDetectionTwitter

#Dataset : https://www.daviddlewis.com/resources/testcollections/reuters21578/

----------------------- How to Run ------------------------------

Make sure you have python3 and the following modules installed: -> Numpy -> Sklearn -> Matplotlib -> Pandas -> nltk (with stopword downloaded!)
git checkout to main branch(which has LSH and Community detection ) or git checkout to clustering branch (which has clustering method)
Open constants.py and modify it with the values you would like
Run main.py

#Steps:

Pre-processing Reuters dataset : Write regex to parse and extract stories in memory.# Return a iterator (yield) to extracted stories)
Character shingle : Length - 7 :Function that takes a string and returns it's shingle id
Count and calculate tf and idf scores: i) Python dictionary, add key as we see.
Get the tf-idf scores of those shingles.
Throw away stop words ( idf > 0.9 ) and do for each key in dictionary : key -> index.
Generate 100 hash functions ( ( k * x + r) % c ) G.C.D (k, c) == 1.
Get signature matrix for each hash function. i) the value in each element of matrix is bool(tf-idf score > PARAM_threshold_tf_idf).
L.S.H : i) PARAM_number_of_bands , PARAM_number_of_rows_in_each_band and split. ii) Generate K bucket hash functions : f(xor of the band per document) mod K. iii) Confirm it with cosine similarity for candidate pairs O(Summation(Candidate pairsC2)) iv) Key : PARAM_number_bands_matched, PARAM_threshold_cosine_similarity : Remove the false positives. v) Pick the ones with the least time stamp -> First story & family detected.

LSH split : Ankit : False positive removal , candidate pair checking and final decision Raaghav : Dynamic bucketing using (Method - ii) Siddhartha : Static bucketing + False postitive removal using similarity score in all pairs per bucket.

#Extension:

Extend dynamic method without ML Abhinav:
Extend dynamic method using ML ( Transfer Learning)

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Dataset		Dataset
stories		stories
.gitignore		.gitignore
DeclareStories.py		DeclareStories.py
FalsePositiveRemoval.py		FalsePositiveRemoval.py
LSH.py		LSH.py
MinHash.py		MinHash.py
README.md		README.md
StoryGeneration.py		StoryGeneration.py
TFIDF.py		TFIDF.py
TFIDF_optim.py		TFIDF_optim.py
__init__.py		__init__.py
a.out		a.out
community_detection.py		community_detection.py
constants.py		constants.py
ground_truth		ground_truth
ground_truth.cpp		ground_truth.cpp
hash_helper.py		hash_helper.py
jaccard_truth.txt		jaccard_truth.txt
main.py		main.py
sampleOutput.txt		sampleOutput.txt
signature_matrix_cache.dat		signature_matrix_cache.dat
topic_model.py		topic_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FirstStoryDetectionTwitter

About

Releases

Packages

Contributors 2

Languages

mishra-sid/FirstStoryDetectionTwitter

Folders and files

Latest commit

History

Repository files navigation

FirstStoryDetectionTwitter

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages