Predictive Analytics with Spark

Project Description

The objective of the project is to implement a movie genre prediction model using Apache Spark
The dataset provided here contains information about movies.
train.csv has movie summaries of around 31K movies along with their genres. You will use this to train your predictive analytics model
test.csv has just plot summaries. You will be predicting the genre of these movies
The task of predicting the genre is essentially a multi-label classification problem. A movie can have multiple genres associated with it. Your model should be able to predict all the genre associated with the movie
The mapping of the genre to the string index should be generated in .csv format. For example presence of genre ‘Drama’ is indicated by a ‘1’ in the first position of the prediction string and an absence of this genre is indicated by ‘0 in the first position

Reach out to me

Basic Model (term-document matrix)

Analyze the data and preprocess it if needed
Create a machine learning model (use any algorithm) in spark to use the information provided in the train set to predict the genres associated with a movie.
You should create a term-document matrix from the plots and use these as feature vectors for the machine learning model.
CountVectorizer Demo

TF-IDF to improve the model

Focussing on the summary of the movie, implement Term Frequency-Inverse Document Frequency (TF-IDF) based feature engineering technique to improve the performance of the model
Ideally, your model should improve performance from the previous step
TF-IDF Demo

Feature Engineering (Word2vec)

Implement any one of the modern text-based feature engineering methodology to improve the performance of the model
Custom feature engineering would be deemed successful only if the model performs better than the model of part 2
Word2Vec Demo

Execution

Upload train.csv ,test.csv and jupter notebooks to Google Colab .
After running for each notebook for certain time say (50 mins) a file with extension .csv will be generated containg the predictions of given test data(test.csv)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
part1		part1
part2		part2
part3		part3
Part-1.gif		Part-1.gif
Part-1.mov		Part-1.mov
Part-2.gif		Part-2.gif
Part-2.mov		Part-2.mov
Part-3.gif		Part-3.gif
Part-3.mov		Part-3.mov
Pyspark Genre Predication CountVectorizer.ipynb		Pyspark Genre Predication CountVectorizer.ipynb
Pyspark Genre Predication TF-IDF.ipynb		Pyspark Genre Predication TF-IDF.ipynb
Pyspark Genre Predication Word2Vec.ipynb		Pyspark Genre Predication Word2Vec.ipynb
README.md		README.md
Report.docx		Report.docx
mapping.csv		mapping.csv
predictions_part1.csv		predictions_part1.csv
predictions_part2.csv		predictions_part2.csv
predictions_part3.csv		predictions_part3.csv
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predictive Analytics with Spark

Project Description

Reach out to me

Basic Model (term-document matrix)

TF-IDF to improve the model

Feature Engineering (Word2vec)

Execution

About

Releases

Packages

Languages

prabha1729/Multi-Label-Movie-Genre-Prediction

Folders and files

Latest commit

History

Repository files navigation

Predictive Analytics with Spark

Project Description

Reach out to me

Basic Model (term-document matrix)

TF-IDF to improve the model

Feature Engineering (Word2vec)

Execution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages