We performed a general analysis over Reddit comments from April 2012, and the influence of worlwide events on them, with Apache Spark.
Group 14
- Gustavo Álvarez.
- Tomás Leyton.
- Bastián Matamala.
The main goal of the project is to perform a exploratory data analysis over reddit coments, trying to answer the following questions:
- Number of comments per each day of the month
- Over all Reddit and on specific subreddits
- Number of comments per hour
- Over all Reddit and on specific subreddits
- The influence of worldwide events on the behaviour of Reddit users, and by "worldwide events" we mean the release of the second season of Game of Thrones (we also tried with Avengers, but the movie release was on May).
Due to the high quantity of comments in Reddit, we worked with only one month of data, Choosing April 2012 because on this month the second season of Game of Thrones was released (every Sunday on HBO), and this was a huge event specially for the Reddit community with a more "geek" profile.
The dataset we used can be found on the following link https://files.pushshift.io/reddit/comments/, where the comments are stored as a JSON Object. With the original BZIP2 compression its size is 1 786 140 247 bytes and once descompressed 10 994 516 204 bytes.
The data contains, according to the uploader, 19 044 534 JSON Objects, each object having the following structure as it is shown on Image 1
We used SPARK to perform a set of map/reduce tasks as follows:
- To map we use ((subreddit_id, subreddit), 1) pairs to map the number of comments per subreddit.
- Reduce process works counting the number of comments by each subreddit.
With this pipeline we can count the comments per subreddit, and then also get the global count. We can, then, filter the results for specifics topics, like r/gameofthrones subreddit. We can also use the hour and day as key, and obtain the counts per timeframe.
Counting task performed:
- Counting comments by subreddit per day (filtering only those with more than 1000 comments).
- Counting comments by hour across the month for each subreddit (filtering only those with more than 1250 comments).
- Counting the total amount of karma (+1/-1) over all comments by hour.
- Counting the total comments per day and hour across all subreddits.
As a last task, we tried to find the amount of common redditors (or users) for each pair of subreddits. Sadly, as this was such a huge operation, we were forced to limit our search to only valuable redditors: those with at least one comment with more than 100 Karma (score) in each subreddit.
We cached the RDDs we were going to use more than one time, but we did not persist. Why? Mostly because errors such Thread Exceptions or Null Pointers Exceptions that did not happened without it. So...
We had some troubles working with the JSON Objects, as the standard library provided for the Labs did not include DataFrames nor a method to read from a JSON. That was fixed downloading the additional libraries, after we failed to parse the data to TSV with PIG (because the comments had line breaks and stuff like that).
We also had some troubles trying to use the file compressed in BZ2, with an Index Out of Bounds Exception. A possible explanation is that CBZip2InputStream, which the current version of Hadoop in the server uses to decompress files in this format, is not threadsafe and in result fails when combining with Spark. We solve this issue by decompressing the file before running our Spark job.
Our runtimes were about 6 to 20 minutes, the last one when we searched for common valuable redditors through subreddits.
We found the day with more comments is Tuesday, almost every week, and the day with significantly fewer comments is Sunday.
The timeframe with more comments is between 05:00 and 15:00 UTC±00:00, almost doubling the hour with least comments. If we take in consideration the fact that Reddit was used mostly in the US, the results are consistent with the typical sleeping hours of most populated areas in North America.
The most popular subreddits in base of the number of comments are shown on Image 2:
About Game of Thrones, we found the most commented days are those around the date of release (before and after) as shown on Image 3, with peaks the day inmediately after their airing:
On the other hand, we found that the most commented hours are not significantly different that the rest of Reddit (see Appendix), as shown on Image 4, but with a peak in the hours after the airing the episode (consideering that this happens usually at 01:00 AM UTC±00:00)
Finally, on Image 4 it is shown the number of valuable users (those with comments with more than 100 points of Karma) per pair of subreddits (just the top ten):
As it is shown, Reddit activity is focused on laboral days, meaning, at least in 2012, it was an important tool for procrastination. Maybe it was not blocked in workplaces networks, like is the usual for Facebook, Instagram or other more common social networks.
In this case, a relevant event like the premiere of each episode of the second season of GoT is followed with peaks of activity in the related subreddits, as predicted. We also found, although we did not include charts, that the number of comments in /r/masseffect is decreasing as the month passes, which makes sense when we consider the third part of the game was launched in March 2012.
Also is curious the fact there is just a few redditors that have comments with more than 100 Karma in at least two subreddits, meaning that it was not quite easy to obtain upvotes at the time.
One of the main conclusions is that distributed systems are very helpful to process huge amounts of data is less time. Sadly, we could not do everything we wanted, mostly because external reasons such unexpected behaviours with the cluster, or delays related to technical aspects. But we can say with high confidence, that the last task would have been impossible on a single machine.