diff --git a/Dexamethasone Treatment Differential Gene Expression Analysis with Machine Learning.pdf b/Dexamethasone Treatment Differential Gene Expression Analysis with Machine Learning.pdf new file mode 100644 index 0000000..f78ed73 Binary files /dev/null and b/Dexamethasone Treatment Differential Gene Expression Analysis with Machine Learning.pdf differ diff --git a/Source Code.Rmd b/Source Code.Rmd new file mode 100644 index 0000000..197ff3e --- /dev/null +++ b/Source Code.Rmd @@ -0,0 +1,428 @@ +--- +title: "Differential Gene Expression Analysis" +author: "Yogindra Raghav, Nityam Rathi, Matt Eckelmeyer, Kyle Coleman" +date: "November 19, 2018" +output: word_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +**Goals** + +1. Identify which genes are differentially expressed after dexamethasone treatment. + +NOTE: The threshold for classifying a gene as "differentially expressed" will be determined based on the values we obtain after tidying the data. The subset tabulation that includes differentially expressed genes will be the first result of our statistical analysis. The term "signal intensity" refers to how strongly a gene is being expressed at a given time. + + +2. Construct biplot and volcano plot that will allow us to more easily visualize the genes that are classified as differentially expressed at each of the two time points. + +3. Construct statistical prediction model (linear regression) using a random subset of the data. We aim to give the model quantitative information to see how accurate predictions from the model are. Residuals will be plotted and the summary statistics of our regression modeling will be the second result of our statistical analysis. + + +**Analysis** + +Loading packages: + +```{r,message=FALSE, warning = FALSE } + +library(mdsr) # for data tidying and analysis +library(readr) # for reading in files + +``` + + +Loading datasets into variables: + +```{r, results = 'hide', message=FALSE} + +hour_0_replicate_1 = read_delim("/Users/MECKELMEYER/Desktop/GSM154262.txt",delim = "\t", skip = 9) # load in tab delimited file and skip first 9 lines since they contain statistical information about Agilent Microarray machine that is irrelevant for our analysis. + +hour_0_replicate_2 = read_delim("/Users/MECKELMEYER/Desktop/GSM154263.txt",delim = "\t", skip = 9) + +hour_0_replicate_3 = read_delim("/Users/MECKELMEYER/Desktop/GSM154264.txt",delim = "\t", skip = 9) + +hour_6_replicate_1 = read_delim("/Users/MECKELMEYER/Desktop/GSM154277.txt",delim = "\t", skip = 9) + +hour_6_replicate_2 = read_delim("/Users/MECKELMEYER/Desktop/GSM154278.txt",delim = "\t", skip = 9) + +hour_6_replicate_3 = read_delim("/Users/MECKELMEYER/Desktop/GSM154279.txt",delim = "\t", skip = 9) + + +``` + + +Selecting for relevant columns: + +```{r} + +# GeneName will be used for joining tables of biological replicates. +# gNetSignal gives signal difference between signal intensity and background intensity, therefore giving true normalized measure of intensity. +# gIsWellAboveBG is for genes whose intensity signal is 2.6 standard deviations from average signal intensity. + +hour_0_replicate_1 = hour_0_replicate_1 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG) +hour_0_replicate_2 = hour_0_replicate_2 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG) +hour_0_replicate_3 = hour_0_replicate_3 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG) +hour_6_replicate_1 = hour_6_replicate_1 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG) +hour_6_replicate_2 = hour_6_replicate_2 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG) +hour_6_replicate_3 = hour_6_replicate_3 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG) +``` + +Getting rid of "placeholder" gene names: +```{r} + +# Certain signal intensities observed by the instrument could not be associated with any known gene. For these genes, the gene name column was given a placeholder name of the molecular probe ID (ProbeName) that was used to target the gene. + +# First we must figure out what "placeholder" values could be possible for genes by looking at the ProbeName column. ProbeName entries are standardized and will have the same possibilities between all data files. For this reason, we will go ahead and use the hour_0_replicate_1 table to find the prefixes of all possible ProbeNames that could be put into the GeneName column. + +ProbeName = hour_0_replicate_1$ProbeName + +substr(ProbeName, start = 1, stop =2) %>% unique() + +# These are possible substrings. Upon further analysis of the "Br" prefix, we realized the "Br" prefix in ProbeName refers to a control sample called "BrightCorner" which we need to get rid of from our data set (assuming it is contained at some point in the GeneName column). Another control GeneName by convention is "NegativeControl" which we will also filter out. + +# Let's filter out control entries: + +hour_0_replicate_1 = hour_0_replicate_1 %>% filter(GeneName!="BrightCorner"& GeneName!="NegativeControl") +hour_0_replicate_2 = hour_0_replicate_2 %>% filter(GeneName!="BrightCorner"& GeneName!="NegativeControl") +hour_0_replicate_3 = hour_0_replicate_3 %>% filter(GeneName!="BrightCorner"& GeneName!="NegativeControl") +hour_6_replicate_1 = hour_6_replicate_1 %>% filter(GeneName!="BrightCorner" & GeneName!="NegativeControl") +hour_6_replicate_2 = hour_6_replicate_2 %>% filter(GeneName!="BrightCorner" & GeneName!="NegativeControl") +hour_6_replicate_3 = hour_6_replicate_3 %>% filter(GeneName!="BrightCorner" & GeneName!="NegativeControl") + +# Let's filter out possible probe entries that are being used as placeholder gene names: + +hour_0_replicate_1 = hour_0_replicate_1 %>% filter(substr(GeneName, 1,2) != "(-") +hour_0_replicate_1 = hour_0_replicate_1 %>% filter(substr(GeneName, 1,2) != "(+") +hour_0_replicate_1 = hour_0_replicate_1 %>% filter(substr(GeneName, 1,2) != "A_") + + +hour_0_replicate_2 = hour_0_replicate_2 %>% filter(substr(GeneName, 1,2) != "(-") +hour_0_replicate_2 = hour_0_replicate_2 %>% filter(substr(GeneName, 1,2) != "(+") +hour_0_replicate_2 = hour_0_replicate_2 %>% filter(substr(GeneName, 1,2) != "A_") + + +hour_0_replicate_3 = hour_0_replicate_3 %>% filter(substr(GeneName, 1,2) != "(-") +hour_0_replicate_3 = hour_0_replicate_3 %>% filter(substr(GeneName, 1,2) != "(+") +hour_0_replicate_3 = hour_0_replicate_3 %>% filter(substr(GeneName, 1,2) != "A_") + +hour_6_replicate_1 = hour_6_replicate_1 %>% filter(substr(GeneName, 1,2) != "(-") +hour_6_replicate_1 = hour_6_replicate_1 %>% filter(substr(GeneName, 1,2) != "(+") +hour_6_replicate_1 = hour_6_replicate_1 %>% filter(substr(GeneName, 1,2) != "A_") + +hour_6_replicate_2 = hour_6_replicate_2 %>% filter(substr(GeneName, 1,2) != "(-") +hour_6_replicate_2 = hour_6_replicate_2 %>% filter(substr(GeneName, 1,2) != "(+") +hour_6_replicate_2 = hour_6_replicate_2 %>% filter(substr(GeneName, 1,2) != "A_") + +hour_6_replicate_3 = hour_6_replicate_3 %>% filter(substr(GeneName, 1,2) != "(-") +hour_6_replicate_3 = hour_6_replicate_3 %>% filter(substr(GeneName, 1,2) != "(+") +hour_6_replicate_3 = hour_6_replicate_3 %>% filter(substr(GeneName, 1,2) != "A_") + + + + + +``` + + +Averaging Identical Gene Names: + +```{r} + +# There are rows that contain identical GeneName entries. It is important for us to average the gNetSignal found for these genes so that we do not consider the same gene as a different data point. + +hour_0_replicate_1 = hour_0_replicate_1 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG)) # for every gene name this averages gNetSignal and gIsWellAboveBG + +nrow(hour_0_replicate_1) + +hour_0_replicate_2 = hour_0_replicate_2 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG)) + +nrow(hour_0_replicate_2) + +hour_0_replicate_3 = hour_0_replicate_3 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG)) + +nrow(hour_0_replicate_3) + +hour_6_replicate_1 = hour_6_replicate_1 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG)) + +nrow(hour_6_replicate_1) + +hour_6_replicate_2 = hour_6_replicate_2 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG)) + +nrow(hour_6_replicate_2) + +hour_6_replicate_3 = hour_6_replicate_3 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG)) + +nrow(hour_6_replicate_3) + +# First 5 GeneName entries after summarise are alpha-numeric strings that are placeholders. As you can see, this data really is messy. + +hour_0_replicate_1 = tail(hour_0_replicate_1, -5) # get rid of first five entries. + +hour_0_replicate_2 = tail(hour_0_replicate_2, -5) + +hour_0_replicate_3 = tail(hour_0_replicate_3, -5) + +hour_6_replicate_1 = tail(hour_6_replicate_1, -5) + +hour_6_replicate_2 = tail(hour_6_replicate_2, -5) + +hour_6_replicate_3 = tail(hour_6_replicate_3, -5) + +``` + + +Joining biological replicates: + +```{r} + +# gNetSignal is normalized value that takes into account changes in machine background intensity for microarrays. Based on this, we can compare gNetSignal values between different runs on the same machine. These were done with biological replicates. For this reason, we are going to average gNetSignal values across the three replicates for each gene. + +hour_0_isoform_a = left_join(hour_0_replicate_1, hour_0_replicate_2, by = "GeneName") # we can left_join since we have confirmed with anti_join (not included) and nrow() that the tables have the same genes. + +hour_0_isoform_a = hour_0_isoform_a %>% left_join(., hour_0_replicate_3, by = "GeneName") # join it once more + +head(hour_0_isoform_a) + +hour_0_isoform_a = hour_0_isoform_a %>% group_by(GeneName) %>% summarise(gNetSignal = (gNetSignal.x+gNetSignal.y+gNetSignal)/3, gIsWellAboveBG = (gIsWellAboveBG.x+ gIsWellAboveBG.y + gIsWellAboveBG)/3) # Average the gNetSignal and gIsWellAboveBg values between the 3 biological replicates + +head(hour_0_isoform_a) + +hour_6_isoform_a = left_join(hour_6_replicate_1, hour_6_replicate_2, by = "GeneName") + +hour_6_isoform_a = hour_6_isoform_a %>% left_join(., hour_6_replicate_3, by = "GeneName") + +hour_6_isoform_a = hour_6_isoform_a %>% group_by(GeneName) %>% summarise(gNetSignal = (gNetSignal.x+gNetSignal.y+gNetSignal)/3, gIsWellAboveBG = (gIsWellAboveBG.x+ gIsWellAboveBG.y + gIsWellAboveBG)/3) + +head(hour_6_isoform_a) + +``` + + +Merging both time points into one table and renaming columns: + +```{r} + +hour_0_hour_6_isoform_a_combined = hour_0_isoform_a %>% left_join(hour_6_isoform_a, by = "GeneName") # join the two tables using "GeneName" + +head(hour_0_hour_6_isoform_a_combined) + +colnames(hour_0_hour_6_isoform_a_combined)[colnames(hour_0_hour_6_isoform_a_combined) == "gNetSignal.x"] <- "Hour_0_gNetSignal" # change column names + +colnames(hour_0_hour_6_isoform_a_combined)[colnames(hour_0_hour_6_isoform_a_combined) == "gNetSignal.y"] <- "Hour_6_gNetSignal" + +colnames(hour_0_hour_6_isoform_a_combined)[colnames(hour_0_hour_6_isoform_a_combined) == "gIsWellAboveBG.x"] <- "Hour_0_gIsWellAboveBG" + +colnames(hour_0_hour_6_isoform_a_combined)[colnames(hour_0_hour_6_isoform_a_combined) == "gIsWellAboveBG.y"] <- "Hour_6_gIsWellAboveBG" + +head(hour_0_hour_6_isoform_a_combined) # show new column names + + +``` + + + +Log Transform gNetSignal values: + +```{r} + +max(hour_0_hour_6_isoform_a_combined$Hour_0_gNetSignal) +min(hour_0_hour_6_isoform_a_combined$Hour_0_gNetSignal) +max(hour_0_hour_6_isoform_a_combined$Hour_6_gNetSignal) +min(hour_0_hour_6_isoform_a_combined$Hour_6_gNetSignal) + +# since spread of values is really great for gNetSignal variable, we are going to log transform these columns. + +hour_0_hour_6_isoform_a_combined= hour_0_hour_6_isoform_a_combined %>% mutate(log2_Hour_0_gNetSignal = log2(Hour_0_gNetSignal), log2_Hour_6_gNetSignal = log2(Hour_6_gNetSignal)) # log transform these values + +head(hour_0_hour_6_isoform_a_combined) + + +``` + + +Calculate Signal Difference: + +```{r} + +# Making a column that contains the difference between the log2 values. We are going to subtract the values from the log normalized values at each time point since subtracting logs is equivalent to the fold change. For example, if log2(later time point) - log2(earlier time point) = 1, then this corresponds to a 2-fold increase in gene expression between the early and late stages. + +hour_0_hour_6_isoform_a_combined= hour_0_hour_6_isoform_a_combined %>% mutate(difference = log2_Hour_6_gNetSignal - log2_Hour_0_gNetSignal) + +head(hour_0_hour_6_isoform_a_combined) + + +``` + +Plotting Density: + +```{r} + +# Plotting density to show spread of log transformed gNetSignals as well as differences. + +plot(density(hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal)) +plot(density(hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal)) +plot(density(hour_0_hour_6_isoform_a_combined$difference)) # As seen in this density plot containing data for both hour 0 and hour 6, the data appears more normalized, indicating that our normalization was successful to at least a reasonable extent. + +``` + +BiPlot for hour 0 and hour 6 data expression levels: + +```{r} + +#Plot of hour 0 vs. hour 6 gNetSignals. All points that show a 1.5 fold increase between samples are highlighted in red. We deem these points differentially expressed. + +plot(hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal, hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal, xlab="Hour 0 expression", ylab="Hour 6 expression", col=ifelse(hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal/hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal >= 1.5, "red", ifelse(hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal/hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal >= 1.5, "red", "black"))) +abline(a=0, b=1, col= "purple") +``` + +This plot shows hour 0 vs hour 6 expression levels for all genes in our data set. We have highlighted "differentially expressed" genes in red. The criteria for being differentially expressed is a fold-difference of 1.5 or more between hour 0 and hour 6 expression levels. Based on this criteria, we ended up with two distinct clusters of differentially expressed genes, those that have significantly higher expression in hour 6 and those that have significantly higher expression in hour 0. + +```{r} + +#Here we subsetted the original data set to obtain a subset of only differentially expressed genes (genes that show a 1.5 fold increase between samples). + +hour_0_hour_6_isoform_a_combined_diff_expressed = hour_0_hour_6_isoform_a_combined[ which (hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal/hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal >= 1.5 | hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal/hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal >= 1.5), ] +hour_0_hour_6_isoform_a_combined_diff_expressed +``` + +```{r} + +#This creates a subset of differentially expressed genes which show at least a 1.5 fold increase in expression at hour 6 compared to hour 0 (positively differentially expressed). + +hour_0_hour_6_isoform_a_combined_pos_diff_expressed = hour_0_hour_6_isoform_a_combined_diff_expressed[ which (hour_0_hour_6_isoform_a_combined_diff_expressed$log2_Hour_6_gNetSignal/hour_0_hour_6_isoform_a_combined_diff_expressed$log2_Hour_0_gNetSignal >= 1.5), ] +``` + +```{r} + +#This creates a subset of differentially expressed genes which show at least a 1.5 fold increase in expression at hour 0 compared to hour 6 (negatively differentially expressed). + +hour_0_hour_6_isoform_a_combined_neg_diff_expressed = hour_0_hour_6_isoform_a_combined_diff_expressed[ which (hour_0_hour_6_isoform_a_combined_diff_expressed$log2_Hour_0_gNetSignal/hour_0_hour_6_isoform_a_combined_diff_expressed$log2_Hour_6_gNetSignal >= 1.5), ] +``` + +Statistical Prediction Model: + +```{r} + +#Creating the training and testing data sets for all differentially expressed genes. + +set.seed(99) +train3 <- hour_0_hour_6_isoform_a_combined_diff_expressed %>% sample_frac(size = 0.5) +test3 <- hour_0_hour_6_isoform_a_combined_diff_expressed %>% setdiff(train3) +``` + +```{r} + +#Creating the training and testing data sets for positively differentially expressed genes. + +set.seed(99) +train1 <- hour_0_hour_6_isoform_a_combined_pos_diff_expressed %>% sample_frac(size = 0.5) +test1 <- hour_0_hour_6_isoform_a_combined_pos_diff_expressed %>% setdiff(train1) +``` + +```{r} + +#Creating the training and testing data sets for negatively differentially expressed genes. + +set.seed(99) +train2 <- hour_0_hour_6_isoform_a_combined_neg_diff_expressed %>% sample_frac(size = 0.5) +test2 <- hour_0_hour_6_isoform_a_combined_neg_diff_expressed %>% setdiff(train2) +``` + +```{r} + +#Combinding both training sets and testing sets for positive and negative differentially expressed genes. + +train_total = rbind(train1, train2) +test_total = rbind(test1, test2) +``` + + +```{r} + +#Plotting the combined training data for all differentially expressed genes using hour 0 and hour 6 expression levels. We also plotted two regression lines based on the training data, one for the positive differentially expressed genes and one for the negative differentially expressed genes. + +plot(train_total$log2_Hour_0_gNetSignal, train_total$log2_Hour_6_gNetSignal, xlab="Hour 0 expression", ylab="Hour 6 expression", col= "red") +abline(lm(train1$log2_Hour_6_gNetSignal~train1$log2_Hour_0_gNetSignal)) +abline(lm(train2$log2_Hour_6_gNetSignal~train2$log2_Hour_0_gNetSignal)) +``` + + +This plot shows the combined training data for all differentially expressed genes using hour 0 and hour 6 expression levels. The two plotted regression lines are based on the correlation between hour 6 and hour 0 expression levels from the training data. One of the regression lines is for the positive differentially expressed genes and the other is for the negative differentially expressed genes. + +```{r} + +#Plotting the combined testing data for all differentially expressed genes using hour 0 and hour 6 expression levels. We used the same regression lines from our training data analysis to show how closely these training set regression lines fit our testing data sets. + +plot(test_total$log2_Hour_0_gNetSignal, test_total$log2_Hour_6_gNetSignal, xlab="Hour 0 expression", ylab="Hour 6 expression", col= "red") +abline(lm(train1$log2_Hour_6_gNetSignal~train1$log2_Hour_0_gNetSignal)) +abline(lm(train2$log2_Hour_6_gNetSignal~train2$log2_Hour_0_gNetSignal)) +``` + +This plot shows the combined testing data for all differentially expressed genes using hour 0 and hour 6 expression levels. The two plotted regression lines are the same lines established from out training data set. Even though these regression lines are not from our testing data set, they fit the testing data very well and appear to have low residuals. This is an indication that our regression lines from our training data are a good fit for positively and negatively differentially expressed genes. + +```{r} + +#Getting a summary of the linear model for the positively differentially expressed training data set to obtain the regression line coefficents. + +lin_mod_1 = lm(train1$log2_Hour_6_gNetSignal~train1$log2_Hour_0_gNetSignal) +lin_mod_1_summary = summary(lin_mod_1) +lin_mod_1_summary$coefficients +``` + +```{r} + +#Calculating the residuals for the positively differentially expressed test data set using the coefficients from the training data linear model. + +test1$residual = test1$log2_Hour_6_gNetSignal - (1.5909547*test1$log2_Hour_0_gNetSignal + 0.7052012) +test1$residual +``` + +```{r} + +#Getting a summary of the linear model for the negatively differentially expressed training data set to obtain the regression line coefficents. + +lin_mod_2 = lm(train2$log2_Hour_6_gNetSignal~train2$log2_Hour_0_gNetSignal) +lin_mod_2_summary = summary(lin_mod_2) +lin_mod_2_summary$coefficients +``` + +```{r} + +#Calculating the residuals for the negatively differentially expressed test data set using the coefficients from the training data linear model. + +test2$residual = test2$log2_Hour_6_gNetSignal - (0.3069249*test2$log2_Hour_0_gNetSignal + 3.7224380) +test2$residual +``` + +```{r} + +#Calculating SSE for the residuals from the combined differentially expressed genes to determine how well our test data fits our training linear model. + +test_total = rbind(test1, test2) +SSE = sum((test_total$residual)^2) +SSE +``` + + +**Appendix** + +*Group Member's Contribution* +All group members were greatly involved in devising and planning this project, which involved: reading the scientific literature related to the data sets utilized in this study, identifying key points of analysis, and formulating a plan. Nityam Rathi and Yogi Raghav initially mined through the raw datasets (GSE6711) and determined which data sets to use to fulfill our goal. Nityam and Yogi then read pertinent literature on microarray data to identify which variables in the massive data sets could be used to calculate differential gene expression and for subsequent statistical sets. After cutting the data sets to include only our variables of interest (gNetSignal and gIsWellAboveBG), Nityam and Yogi performed data wrangling and appropriate normalization techniques to create a final data set upon which the density plots are shown. This abridged dataset was then used for the differential gene expression analysis and statistal learning/regression models by Kyle and Matt. Nityam and Yogi provided necessary instruction and guidance for subsequent data visualization and statistical learning tests. Matt and Kyle subsetted the data into positively, negatively, and overall differentialluy expressed genes. Matt and Kyle created the expression biplot for hour 0 and hour 6 expression levels to visualize differentially expressed genes. Matt and Kyle also split the data into training and testing data sets for development of a statistical learning model. From this, Matt and Kyle developed a plot and two regression lines for the differentially expressed genes from the training data. They then plotted the testing data with the training data regression lines to see how well these lines fit the testing data. Matt and Kyle then calculated the residuals and SSE for the testing observations relative to the respective training regression lines and found a relatively low SSE, indicating that the model was a good fit. + +*Files Used for Group Project* +*GSM154262.txt +*GSM154263.txt +*GSM154264.txt +*GSM154277.txt +*GSM154278.txt +*GSM154279.txt + + + +*List of individuals' names and datasets used for part 1* +*Nityam Rathi- 'bad_drivers' data set +*Yogi Raghav- 'drug_use' data set +*Kyle Coleman- 'airline-saftey' data set +*Matt Eckelmeyer- 'drug_use' data set \ No newline at end of file