Skip to content
This repository has been archived by the owner on Apr 7, 2022. It is now read-only.
/ r-sample-enron Public archive
forked from mrdavid/r-sample-enron

Analyzing the Enron Email Corpus with R.

Notifications You must be signed in to change notification settings

attaalan/r-sample-enron

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analyzing the Enron Email Corpus - Demo

This R file analyses some of the Enron Email Corpus. It produces 4 PDF files, each containing a graph displaying how different persons are connected through emails present in the corpus.

The Enron Email Corpus can be downloaded from https://www.cs.cmu.edu/~enron/ and the contents should be untared in the same directory as the R script, producing a folder named 'enron_mail_20110402'. The first line of the R script needs to be adjusted to the environment. A script that does all of this automatically is provided (see below).

The R script itself contains more comments with details on each step in the analysis.

Data selection

Due to constrains on time and resources, only a subset of the emails is analysed.

Only emails from 'Sent' or 'Sent_Items' folders are analysed.

This greatly reduces the number of emails. 'Sent' folders will not contain spam and will only contain email sent on purpose by an Enron employee, making it uncessary to filter emails for validity. 'Sent' folders will furthermore not contain duplicates (i.e. an email in the 'Sent' folder of one person will be present in the inbox of another person, but not in any other 'Sent' folder), relieving us of the work of identifying such duplicates and removing them.

The 'Sent' type folders are filtered out with the following command:

find . -type d  | grep -i sent | grep -v presentation > sent_folders

Only emails sent from one person to exactly one person are analysed

Although the data set contains many emails sent to several receipients, parsing the To: field of the emails was skipped for time constraints. Only emails sent directly to one receipient are kept.

Analysis: Mails sent between people whose mailboxes are in the data set

Graph: 00.most.mails.pdf

The first analysis looks only at emails that are sent between people whose mailbox is in the data set. To do that, only emails are kept that are sent to someone that has also sent a mail him/herself.

The graph shows connections between people where more than 150 emails have been sent. This number reduces the data set to only 21 people, a number that can still be plotted nicely. The thickness of each arch in the graph shows the volume of email exchanged. The size of each vertex is proportional to the number of different people that person has sent emails to overall.

Within the selected data, unfortunately most mails sent directly seem to be quite disconnected from each other. The connection between Kay Mann and Suzanne Adams is exceptionally strong.

Analysis: Important people

We quickly analyse some important people within the network. A paper by Shetty and Adibi suggests these names:

  • Louise Kitchen
  • Mike Grigsby
  • Greg Whalley
  • Scott Neal
  • Kenneth Lay

Mails to or from important people: Whole network

Graph: 01.important.people.pdf

We first simply plot the whole network of people that have sent mails to or received mails from one of the members in the list of important people. The graph shows some interesting properties. While there are a lot of people that only exchange emails with one of the important people (this is probably in part due to their inboxes not being in the data set), there is a group of people (in the center of the graph) that is connected to nearly all of the important people. (E.g. John Lavorato, Jeffrey Shankman, etc)

Mails to or from important people: Network of 'well connected' people

Graph: 02.important.people.well.connected.pdf

The graph shows only those people in the network that have more than 4 connections. Within the selected data set, this will be people that are connected to most of the 'important people' in the network. One can see that nearly everybody is connected to Louise Kitchen.

Mails to or from important people: Strength of connections between 'well connected' people

Graph: 03.important.people.well.connected.weights.pdf

The last graph shows the network of 'well connected' people with thickness of lines indicating the number of emails sent from one person to the other. The connection between Louise Kitchen and John Lavarato and Sally Beck is exceptionally strong. Furthermore, John Lavarato has quite strong connections to all of the 'important people' mentioned above.

Running the analysis from scratch

The analysis can be run from scratch, using only the R script and assuming that the relevant libraries are installed by using

chmod u+x run.sh
./run.sh

Sources

Source of Enron Corpus: https://www.cs.cmu.edu/~enron/

Please note: This a demo of R rather than a proper analysis. Due to time constraints, only a small part of the emails was analyzed. No conclusions should be drawn from the results without analyzing the whole dataset.

About

Analyzing the Enron Email Corpus with R.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 93.4%
  • Shell 6.6%