Sentiment Analysis with Spark Streaming

Utilizing Spark Streaming to stream a corpus of tweets and their corresponding sentiment labels, this repository details the conduction of a study on the training and evaluation of multiple classification and clustering online/incremental learning ML models that are able to learn on batches of data streamed over time. Analysis is done on various performance metrics against varying hyperparameter and streaming batch size values; the corresponding trends are plotted for each combination of variables under analysis.

Machine Learning Pipeline

Current Repository Structure

.
├── Batch_1000
├── Batch_2000
├── Batch_2500
├── Batch_3000
├── Batch_4000
├── Batch_5000
├── classification_models
│   ├── logistic_regression.py
│   ├── multinomial_nb.py
│   ├── passive_aggressive.py
├── clustering_models
│   ├── kmeans_clustering.py
│   ├── birch_clustering.py
├── preprocessing
│   ├── preprocess.py
├── .gitignore
├── LICENSE
├── README.md
├── Sentiment Analysis Using Streaming Spark.pdf
├── batch_accuracy_MNB
├── batch_accuracy_PAC
├── batch_accuracy_SGD
├── batch_test_accuracies.py
├── hyper_test_accuracies.py
├── requirements.txt
├── test.py
└── train.py

About the Dataset

Two CSV files each for training (with 1520k records) and testing (with 80k records).
Each record has two columns, one for the sentiment, and the other, the tweet.
Sentiment is either 0 (negative) or 4 (positive).

Task Workflow

Streaming the data with Spark Streaming.
Cleaning and Preprocessing each RDD of input data.
Online/incremental training of the classification models.
Online/incremental training of the clustering models
Testing the classification and clustering models against cleaned and preprocessed RDDs of the test data stream.
Plotting graphs for analysis and evalutation of the models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis with Spark Streaming

Machine Learning Pipeline

Current Repository Structure

About the Dataset

Task Workflow

About

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
Batch_1000		Batch_1000
Batch_2000		Batch_2000
Batch_2500		Batch_2500
Batch_3000		Batch_3000
Batch_4000		Batch_4000
Batch_5000		Batch_5000
classification_models		classification_models
clustering_models		clustering_models
preprocessing		preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Sentiment Analysis Using Streaming Spark.pdf		Sentiment Analysis Using Streaming Spark.pdf
Sentiment_Analysis_with_Spark_Presentation.pdf		Sentiment_Analysis_with_Spark_Presentation.pdf
Sentiment_Analysis_with_Spark_Report.pdf		Sentiment_Analysis_with_Spark_Report.pdf
batch_accuracy_MNB		batch_accuracy_MNB
batch_accuracy_MNB.eps		batch_accuracy_MNB.eps
batch_accuracy_PAC		batch_accuracy_PAC
batch_accuracy_PAC.eps		batch_accuracy_PAC.eps
batch_accuracy_SGD		batch_accuracy_SGD
batch_accuracy_SGD.eps		batch_accuracy_SGD.eps
batch_test_accuracies.py		batch_test_accuracies.py
hyper_test_accuracies.py		hyper_test_accuracies.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

License

Big-Data-Course-Team/Machine-Learning-with-Spark-Streaming

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis with Spark Streaming

Machine Learning Pipeline

Current Repository Structure

About the Dataset

Task Workflow

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

Languages