Skip to content

This project is an semantic search engine built using sentence transformers for embedding on recent cricket data.

Notifications You must be signed in to change notification settings

arjunprakash027/CricketSemantics

Repository files navigation

CricketSemantics

CricketSemantics is a NLP project that includes

  • A Cricket commentary dataset and scrapping engine to scrape data from cricbuzz
  • A Semantic Search Engine built using SentenceTransformers and FAISS.
  • A Doc2Vec sentence embeddings and KMeans Cluster (n=5)

Data Scrapping

  • Scrapped Data from cricbuzz
  • Used Scrapy to scrape the data
  • Data here
  • full blog on how to do this here

Commentary Search Engine

How to run locally.

  • Download this repo into your local system

-Then

pip install -r requirements.txt

-Then go to your commandline or terminal

python3 semanticSearchCricket.py

Examples

image

Doc2Vec model and KMeans Cluster

  • The Code is in this kaggle notebook

  • Created a Doc2Vec embeddings on the cricket commentary dataset.

  • Performed Principle Component Analysis (PCA) on the embedding vectors to reduce it from size 100 to size 2

  • Used KMeans clustering and clustered the reduced embedding data into 5 distinct clusters.

  • Used matplotlib to create a scatterplot to visualize the clusters.

Examples

image

About

This project is an semantic search engine built using sentence transformers for embedding on recent cricket data.

Resources

Stars

Watchers

Forks

Packages

No packages published