CricketSemantics is a NLP project that includes
- A Cricket commentary dataset and scrapping engine to scrape data from cricbuzz
- A Semantic Search Engine built using SentenceTransformers and FAISS.
- A Doc2Vec sentence embeddings and KMeans Cluster (n=5)
- Scrapped Data from cricbuzz
- Used Scrapy to scrape the data
- Data here
- full blog on how to do this here
- Download this repo into your local system
-Then
pip install -r requirements.txt
-Then go to your commandline or terminal
python3 semanticSearchCricket.py
-
The Code is in this kaggle notebook
-
Created a Doc2Vec embeddings on the cricket commentary dataset.
-
Performed Principle Component Analysis (PCA) on the embedding vectors to reduce it from size 100 to size 2
-
Used KMeans clustering and clustered the reduced embedding data into 5 distinct clusters.
-
Used matplotlib to create a scatterplot to visualize the clusters.