---
title: "Enron Network Analysis Tutorial"
date: `r date()`
output: pdf_document
---
**Enron Tutorial**
We provide this Enron Tutorial as an appendix to the paper in Journal of Statistical Education, *Network Analysis with the Enron Email Corpus*. The paper describes the centrality measures in detail, and we go through the steps in the R analysis here.
As in the .Rmd file with the R code, be sure to install the pacakges *WGCNA* and *igraph*.
```{r, include=FALSE}
# load the libraries
#install.packages(c("igraph"))
#source("http://bioconductor.org/biocLite.R")
#biocLite(c("WGCNA"))
library(WGCNA)
library(igraph)
library(RColorBrewer)
library(gplots)
```
```{r, include=FALSE}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 1)
}
```
The first step of the analysis is to import the data and create the adjacency matrix of our choice. AM represents emails sent from node $i$ to node $j$ with messages sent via CC weighted as described in the paper. The transpose of the matrix AMt represents emails recieved by node $i$ from node $j$ (again, with CC values weighted differently than messages sent directly to an individual). The sum of the two matrices, AM2, represents the total email correspondence (sent and received) between nodes $i$ and $j$.
```{r}
setwd("H:/teaching/ResearchCircle/Spring2014 - DataScience/JSE.DSS")
AM = as.matrix(read.csv("Final Adjacency Matrix.csv",
sep=",", header=TRUE, row.names=1)) # sent emails
AMlist = read.csv("Final Adjacency Matrix.csv",
sep=",", header=TRUE, row.names=1) # sent emails as list
# employee information might be interesting to analyze for considering relationships
# within the company
# enronemployees = read.table("Enron Employee Information.csv", sep=",", header=T)
AMt = t(AM) # received emails
AM2 <- AM + t(AM) - 2*diag(diag(AM)) # sent and received emails
```
We can represent the adjacency matrix graphically using a heatmap.
```{r}
AM.names=c(rep(NA,20), row.names(AM)[21],rep(NA,44), row.names(AM)[66],
rep(NA,2), row.names(AM)[69], rep(NA,87))
heatmap.2(log2(AM+1), Rowv=FALSE, Colv= FALSE, dendrogram="none",
col = (brewer.pal(9,"Blues")),scale="none", trace="none",
labRow=AM.names,labCol=AM.names, colsep=FALSE,
density="none", key.title="", key.xlab="# of emails (log2 scale)" ,
mar=c(8,8))
```
**Eigenvector Centrality**
The first measure of centrality that we use is eigenvector centrality; the *evcent* function is available in the *igraph* package. Degree, Betweenness, and Closeness centrality measures are also given in the *igraph* package.
```{r}
# eigenvalue centrality (on both directed graphs),
# degree, betweenness, and closeness
eng <- graph.adjacency(do.call(rbind,AMlist)) # creates a network graph using the adjacency matrix
engt <- graph.adjacency(do.call(cbind,AMlist)) # creates a network graph using the transpose of the adjacency matrix
eigcent <- igraph::evcent(eng, directed=TRUE) # eigenvalue centrality
eigcentt <- igraph::evcent(engt, directed=TRUE) # eigenvalue centrality on transpose of graph
dcent <- igraph::degree(eng) # degree centrality
bmeas <- igraph::betweenness(eng) # betweenness
cmeas <- igraph::closeness(eng) # closeness
# TOM
AM2 <- AM2 / max(AM2) # set values between 0 and 1
TOM <- TOMsimilarity(AM2) # create TOM
TOMrank <- as.matrix(apply(TOM,1,sum)) # grab its row-sums
rownames(TOMrank) <- rownames(AM)
colnames(TOMrank) <- "value"
```
Initially, we plot the ranks of the individuals based on the different measures of centrality. The ranks are clearly correlated, but we can also see that they seem to be measuring different qualities of the email correspondence matrix.
```{r}
comptable <- matrix(ncol=6, nrow=dim(AM)[1])
comptable[,1] <- rank(dcent)
comptable[,2] <- rank(eigcent$vector)
comptable[,3] <- rank(eigcentt$vector)
comptable[,4] <- rank(cmeas)
comptable[,5] <- rank(bmeas)
comptable[,6] <- rank(TOMrank)
pairs(comptable[,1:6],pch=20,main="Ranking Metrics Comparison",
labels=c("Degree","EV Cent.", "EV Cent. (T)","Closeness","Betweenness", "TOM"),
cex=.5,xlim=c(0,160),ylim=c(0,160),lower.panel=panel.cor)
```
Next, we lists the top 10 most central individuals for each metric. Note that we use the negative of the centrality measure so that the order function produces the first individual as the most central.
```{r}
rankedEnron <- data.frame(Degree = rownames(AM)[order(-dcent)],
EVcent = rownames(AM)[order(-eigcent$vector)],
EVcentT = rownames(AM)[order(-eigcentt$vector)],
Close = rownames(AM)[order(-cmeas)],
Between = rownames(AM)[order(-bmeas)],
TOM = rownames(AM)[order(-TOMrank)])
rankedEnron[1:10,]
```
**Hierarchical Clustering**
Below, we create the heirarchical cluster with both the symmetric (sent and received) adjacency email matrix as well as the TOM adjacency build from the symmetric measures. After building the dendrogram, we find groups of employees who are strongly linked and report the names of the individuals.
```{r}
# dissimilarity is 1 - number of sent and received / max of sent and recieved over all individuals
dissAM2=1-AM2
# Create the heirarchical clustering
hierAM2=hclust(as.dist(dissAM2), method="average")
groups.9=as.character(cutreeStaticColor(hierAM2, cutHeight=.9, minSize=4))
# Plot results of all module detection methods together:
plotDendroAndColors(dendro = hierAM2,colors=data.frame(groups.9),
dendroLabels = FALSE, abHeight=.9,
marAll =c(0.2, 5, 2.7, 0.2), hang=.05,
main ="min 4 per group, cutoff=0.9",ylab="1 - S&R/max(S&R)")
table(groups.9)
row.names(AM)[groups.9=="turquoise"]
row.names(AM)[groups.9=="blue"]
### Now cluster with TOM
# for the next plot, dissimilarity uses TOM metric to encorporate neighbors
dissTOM=TOMdist(AM2)
rownames(dissTOM) <- rownames(AM)
colnames(dissTOM) <- rownames(AM)
# Create the heirarchical clustering
hierTOM=hclust(as.dist(dissTOM), method="average")
groupsTOM.95=as.character(cutreeStaticColor(hierTOM, cutHeight=.95, minSize=4))
# Plot results of all module detection methods together:
plotDendroAndColors(dendro = hierTOM,colors=data.frame(groupsTOM.95), abHeight=.95,
dendroLabels = FALSE, marAll =c(0.2, 5, 2.7, 0.2),
main ="min 4 per group, cutoff=0.95", ylab="TOM dissimilarity")
table(groupsTOM.95)
row.names(AM)[groupsTOM.95=="turquoise"]
row.names(AM)[groupsTOM.95=="blue"]
row.names(AM)[groupsTOM.95=="brown"]
row.names(AM)[groupsTOM.95=="yellow"]
```