Skip to content

Project II of Big Data Analytics (CIIC 8995) given by Dr. Manuel Rodríguez in the University of Puerto Rico, Mayagüez Campus.

Notifications You must be signed in to change notification settings

omarpr/bigdata_p2

Repository files navigation

Big Data Analytics (CIIC 8995) Project II

Project II of Big Data Analytics (CIIC 8995) given by Dr. Manuel Rodríguez in the University of Puerto Rico, Mayagüez Campus.

An example of this is available at http:https://kvm_33.uprm.edu/p2/.

twitter_stream.py

Python Kafka Producer that receives a sample stream of tweets from Twitter, extract only the ones from about trump and send it to the Kafka Server. A credential file is required as twitter_credentials.json. A example of that file is included as twitter_credentials.sample.json.

python3 twitter_stream.py

p2-words.py

Python Kafka Consumer that implements Spark Streams, it receives the json of tweets about trump and separate the tweet into words, remove stop words, count them (reduce) and more to finally, store it on HDFS to be later analyzed.

/opt/spark/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.1 p2-words.py

p2-screennames.py

Python Kafka Consumer that implements Spark Streams, it receives the json of tweets about trump and separate the get the screen name of the tweet, count the occurrences of it (reduce) and more to finally, store it on HDFS to be later analyzed.

/opt/spark/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.1 p2-screennames.py

p2lib.py

Library with the common functions for this project.

words-cron.py

Take as input the files generated by p2-words.py and stored in HDFS to generate the index and files that will be used on a webapp to visualize the data.

screennames-cron.py

Take as input the files generated by p2-screennames.py and stored in HDFS to generate the index and files that will be used on a webapp to visualize the data.

Crontab

A crontab was configured to execute words-cron.py every 10 minutes and screennames-cron.py every 1 hour. The files produced by those two processes are then used to be visualized on a webapp.

*/10 * * * * source /home/omar.soto2/.bash_profile; flock -w 0 /home/omar.soto2/p2/words-cron.lock /opt/spark/bin/spark-submit /home/omar.soto2/p2/words-cron.py >> /home/omar.soto2/p2/cron_log 2>&1
0 * * * * source /home/omar.soto2/.bash_profile; flock -w 0 /home/omar.soto2/p2/screenname-cron.lock /opt/spark/bin/spark-submit --master yarn  --deploy-mode client --py-files /home/omar.soto2/p2/p2lib.py --conf='spark.executorEnv.PYTHONHASHSEED=223' /home/omar.soto2/p2/screennames-cron.py >> /home/omar.soto2/p2/cron_log 2>&1

About

Project II of Big Data Analytics (CIIC 8995) given by Dr. Manuel Rodríguez in the University of Puerto Rico, Mayagüez Campus.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published