Every Gene Expression study has an underlying question which the experimenter tries to address. Before proceeding towards any downstream analysis, the researcher has to play around with the gene expression data to understand the genetic, environmental, population, technical and confounding factors that could potentially have an effect on the gene expression values.This needs to be studied to ensure that the contributions of non biological variables to any observed biological signal is accounted for .There is no ‘one fits for all’ data analysis approach that works for every experiment.This workflow has been designed keeping this in mind, mainly using analyses methods like the supervised normalization of microarrays, surrogate variable analysis and the principal variance component analysis amongst others.
The data is based on the paper by Kim et al (Gene expression profiles associated with acute myocardial infarction and risk of cardiovascular death, Genome Medicine 2014 6:40, http://genomemedicine.com/content/6/5/40 )
Link to the GEO dataset http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49925
The workflow package contains five functions that will be frequently used.These are:
The workflow runs on these two data files
1. CAD_Expression.csv - file containing gene expression values ( 10, 000 probes across 100 samples)
2. CAD_Exptdsgn.csv - file containing the phenotypic information (100 samples across 17 covariates )
1.File containing the gene expression values (log transformed)
Format - Features x Samples (where the features could be probes in the case of microarrays), preferably a comma separated file (.csv)
2.File containing the covariates
Covariates are the different phenotypes that are used to describe the sample from Age, Ethnicity, Height to something like smoking status
Format - Samples x Covariates, preferably a comma separated file (.csv)
Take a look at the sample data to get an idea of how the files should be structured
## Enter the full path for the gene expression values in here
expData_file_name <- scan(what=" ", sep="\n")
exprs <- read.table(expData_file_name, header=TRUE, sep=",", row.names=1, as.is=TRUE) ## reading the file containing
## gene expression values
## Enter the full path for the experimental design file
expDesign_file_name <- scan(what=" ", sep="\n")
covrts <- read.table(expDesign_file_name, row.names=1, header=TRUE, sep=",") ##reading the file containing the covariates
## Warning: replacing previous import 'Biobase::anyMissing' by
## 'matrixStats::anyMissing' when loading 'ExpressionNormalizationWorkflow'
## Warning: replacing previous import 'Biobase::rowMedians' by
## 'matrixStats::rowMedians' when loading 'ExpressionNormalizationWorkflow'
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, cbind, colnames, do.call, duplicated, eval,
## evalq, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, mapply, match, mget, order, paste, pmax, pmax.int,
## pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## EUH02661 EUH01927 EUH02357 CLH00229 EUH02482
## ILMN_2389211 14.09141 13.89190 13.74701 14.05587 14.24310
## ILMN_1667796 13.87223 13.90267 14.27251 14.27583 14.12457
## ILMN_3254322 13.66450 13.83893 13.38150 13.63170 14.23808
## ILMN_1683271 13.41858 13.66475 13.18040 13.48734 14.25162
## ILMN_2100437 14.23668 13.63830 14.15963 14.00426 13.94484
## Study Array Rin Gender Ethn BMI CVD_TYPE Age Height Weight
## EUH02661 A 1 9.2 FEM CAU 31.0 ACUTE 56 1.73 88.9
## EUH01927 A 2 9.1 FEM CAU 26.9 ACUTE 54 1.57 66.7
## EUH02357 A 3 9.1 MAL CAU 20.7 ACUTE 51 1.75 63.5
## CLH00229 A 4 9.3 MAL CAU 24.4 ACUTE 52 1.83 81.5
## EUH02482 A 5 9.3 MAL CAU 26.1 ACUTE 62 1.78 82.6
An ExpressionSet Class[1] is a Biobase data structure that is used to conveniently store experimental information and associated meta data, all in one place. Here an object of the the ExpressionSet Class is being created, which stores the gene expression values and the phenotype data.
inpData <- expSetobj(exprs, covrts)
PVCA estimates the variance in the expression dataset due to each of the given covariates and attributes the remaining fraction to residual.It efficiently combines principal component analysis (PCA) to reduce the feature space and variance components analysis (VCA) which fits a mixed linear model using factors of interest as random effects to estimate and partition the total variability. Here PVCA is used to estimate the variance due to various CAD covariates like BMI, Ethn, CAD, Rin amongst others
cvrts_eff_var <- c("BMI", "Rin", "Ethn", "CAD", "Study")
## Setting the covariates whose effect size on the data needs to be calculated
pct_thrsh <- 0.6 ## PVCA Threshold Value - a value between 0 & 1
## PVCA Threshold Value is the percentile value of the minimum amount of the variabilities that the selected principal components need to explain
pvcAnaly(inpData, pct_thrsh, cvrts_eff_var) ## PVCA
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.002798892 0.02987222 0.1169654 0.01854667 0.3590796 0.4727372
##
## $label
## [1] "BMI" "Rin" "Ethn" "CAD" "Study" "resid"
Surrogate variables are covariates constructed directly from high-dimensional data (gene expression or RNA-Seq data) that can be used in subsequent analyses to adjust for unknown/unmodeled covariates or latent sources of noise. The user provides a biological variable based on which the surrogate variables are generated. These are then appended as new covariates to the existing list of covariates
biol_var_sva <- "CAD" ## Choosing a biological variable that is to be used to calculate the surrogate variables
sur_var_obj <- surVarAnaly(inpData, biol_var_sva) ## SVA
## Number of significant surrogate variables is: 2
## Iteration (out of 5 ):1 2 3 4 5
inpData_sv <- sur_var_obj$expSetobject
var_names <- c("sv1", "sv2") ## sv1 and sv2 are the newly generated surrogate variables
pData(inpData_sv)<-conTocat(pData(inpData_sv), var_names) ## discretizing the continuous surrogate variables
View(pData(inpData_sv))
## The SVs and the categorized SVs are appended to the covariate matrix as additional columns
The Surrogate Variables are categorized as it is more convenient to run PVCA with categorical variables rather than continuous variables
This helps the researcher understand if the newly identified surrogate variable is entirely independent of all the existing covariates and hence a new covariate that classifies the samples in its own way or if the identified surrogate variable is a manifestation or a function of one or more existing covariates.Surrogate variables may not be a significant threat if they are associated with existing covariates but could become a very important factor if they are totally independent of tohers. For this purpose a generalized linear model is run where the identified surrogate variables are modelled as a function of the existing covariates
sv1 ~ CAD+Ethn+Study+BMI+Rin
sv2 ~ CAD+Ethn+Study+BMI+Rin
glm.sv1 <- glm(pData(inpData_sv)[, "sv1"]~pData(inpData_sv)[, "Ethn"]+pData(inpData_sv)[, "BMI"]+pData(inpData_sv)[, "Rin"]
+pData(inpData_sv)[, "CAD"]+pData(inpData_sv)[, "Study"]) ## Fitting a generalized linear model
summary(glm.sv1)
##
## Call:
## glm(formula = pData(inpData_sv)[, "sv1"] ~ pData(inpData_sv)[,
## "Ethn"] + pData(inpData_sv)[, "BMI"] + pData(inpData_sv)[,
## "Rin"] + pData(inpData_sv)[, "CAD"] + pData(inpData_sv)[,
## "Study"])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.252708 -0.047333 0.001098 0.046136 0.185257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.949e-01 1.285e-01 1.517 0.133
## pData(inpData_sv)[, "Ethn"]CAU -1.730e-02 6.759e-02 -0.256 0.799
## pData(inpData_sv)[, "Ethn"]SAS 2.166e-02 1.156e-01 0.187 0.852
## pData(inpData_sv)[, "BMI"] 6.293e-05 1.438e-03 0.044 0.965
## pData(inpData_sv)[, "Rin"] -1.434e-02 1.153e-02 -1.243 0.217
## pData(inpData_sv)[, "CAD"] -2.963e-02 2.725e-02 -1.087 0.280
## pData(inpData_sv)[, "Study"]B -1.094e-01 2.354e-02 -4.647 1.11e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.008410679)
##
## Null deviance: 1.00000 on 99 degrees of freedom
## Residual deviance: 0.78219 on 93 degrees of freedom
## AIC: -185.29
##
## Number of Fisher Scoring iterations: 2
glm.sv2 <- glm(pData(inpData_sv)[, "sv2"]~pData(inpData_sv)[, "Ethn"]+pData(inpData_sv)[, "BMI"]+pData(inpData_sv)[, "Rin"]
+pData(inpData_sv)[, "CAD"]+pData(inpData_sv)[, "Study"]) ## Fitting a generalized linear model
summary(glm.sv2)
##
## Call:
## glm(formula = pData(inpData_sv)[, "sv2"] ~ pData(inpData_sv)[,
## "Ethn"] + pData(inpData_sv)[, "BMI"] + pData(inpData_sv)[,
## "Rin"] + pData(inpData_sv)[, "CAD"] + pData(inpData_sv)[,
## "Study"])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.20412 -0.06335 -0.02316 0.07705 0.27643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2333118 0.1402463 -1.664 0.0996 .
## pData(inpData_sv)[, "Ethn"]CAU 0.0920587 0.0737685 1.248 0.2152
## pData(inpData_sv)[, "Ethn"]SAS 0.2449827 0.1261321 1.942 0.0551 .
## pData(inpData_sv)[, "BMI"] -0.0007588 0.0015691 -0.484 0.6298
## pData(inpData_sv)[, "Rin"] 0.0192423 0.0125841 1.529 0.1296
## pData(inpData_sv)[, "CAD"] -0.0199192 0.0297423 -0.670 0.5047
## pData(inpData_sv)[, "Study"]B 0.0179597 0.0256947 0.699 0.4863
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.01001941)
##
## Null deviance: 1.0000 on 99 degrees of freedom
## Residual deviance: 0.9318 on 93 degrees of freedom
## AIC: -167.79
##
## Number of Fisher Scoring iterations: 2
The following PVCA step is performed to see and compare the effect sizes of the selected covariates including the newly identified surrogate variables
cvrts_eff_var <- c("BMI", "Ethn", "Rin", "CAD", "sv1_cat", "sv2_cat","Study")
## Setting the covariates whose effect size on the data needs to be calculated
pct_thrsh <- 0.6 ## PVCA Threshold Value - value between 0 and 1
## PVCA Threshold Value is the percentile value of the minimum amount of the variabilities that the selected principal components need to explain
pvcAnaly(inpData_sv, pct_thrsh, cvrts_eff_var) ## PVCA
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.005339061 0.02649079 0.01590303 0.07020946 0.1287428 0.3662912
## [,7] [,8]
## [1,] 0.01782403 0.3691996
##
## $label
## [1] "BMI" "Rin" "sv2_cat" "sv1_cat" "Ethn" "Study" "CAD"
## [8] "resid"
SNM is a study specific, customizable normalization approach that accounts for all known biological, adjustment and technical variables . It is very effective in preserving the biological signals while trying to minimize the effects due to various technical confounders
Choose the biological variables, the adjustment variables and the intenstiy dependent variables intelligently based on effect sizes as seen from PVCA.Here we are removing the effects of the study, sv1 and sv2 covariates
bv <- c("CAD") ## Chose your biological variable covariates
av <- c("Study", "sv1_cat", "sv2_cat") ## Chose your adjustment variable covariates
iv <- c("Array") ## Choose your intensity-dependent adjustment variables
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv) ## SNM
##
Iteration: 1
##
Iteration: 2
##
Iteration: 3
##
Iteration: 4
##
Iteration: 5
sv_snmNorm_data <- sv_snmObj$norm.dat
colnames(sv_snmNorm_data) <- colnames(exprs)
sv_snm_data <- expSetobj(sv_snmNorm_data, pData(inpData_sv)) ## Creating an expressionSetObject of the normalized
## data alongwith the covariates
This post SNM PVCA reflects how the effect sizes of the covariates span out after removing the effects of the adjustment variables.SNM brings down their effect to a minimal possible value whil trying to preserve the variation due to the biological signals.
cvrts_eff_var <- c("BMI", "Ethn", "Rin", "CAD", "sv1_cat", "sv2_cat","Study")
## covariates whose effect size on the data needs to be calculated
pct_thrsh <- 0.6 # PVCA Threshold Value - value between 0 and 1
## PVCA Threshold Value is the percentile value of the minimum amount of the variabilities that the selected principal components need to explain
pvcAnaly(sv_snm_data, pct_thrsh, cvrts_eff_var) ## PVCA
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.03526227 0.06639305 0.005242521 0.001186719 0.1391422 0.001658285
## [,7] [,8]
## [1,] 0.03721947 0.7138955
##
## $label
## [1] "BMI" "Rin" "sv2_cat" "sv1_cat" "Ethn" "Study" "CAD"
## [8] "resid"
Through this approach we have reduced the effect of the study covariate and the surrogate variables and increased the residual to 72.1%
This data is now apt for further downstream analysis
This workflow elucidates an expression normalization framework and its integrated functions which can customized based on the experiment requirements to give the best possible normalized data. The sample workflow aims to analyze the gene expression data used in the Coronary Artery Disease study (http://genomemedicine.com/content/7/1/26) , determine its validity for downstream analysis and normalize it if necessary.
CAD_Expression.csv contains gene expression values for Illumina HT-12 genes for 10,000 probes across 100 samples while the CAD_Exptdsgn.csv contains phenotype information for 17 covariates across 100 samples We observed that the expression data has a strong batch effect which we identify, study and remove using this workflow. Study type (batch effect covariate) contributes around 36% to the total variance in the expression which can be seen from the PVCA analysis (figure 1). This effect has to be removed to obtain clean processable data. There are different possible paths to normalize the raw data before proceeding downstream
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.002798892 0.02987222 0.1169654 0.01854667 0.3590796 0.4727372
##
## $label
## [1] "BMI" "Rin" "Ethn" "CAD" "Study" "resid"
##
Iteration: 1
##
Iteration: 2
##
Iteration: 3
##
Iteration: 4
##
Iteration: 5
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0106514 0.0674592 0.1643429 0.03554343 0.009828584 0.7121744
##
## $label
## [1] "BMI" "Rin" "Ethn" "CAD" "Study" "resid"
cvrts_eff_var <- c("BMI", "Ethn", "Rin", "CAD", "sv1_cat", "sv2_cat", "Study")
pct_thrsh <- 0.6
pvcAnaly(sv_snm_data, pct_thrsh, cvrts_eff_var)
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.03526227 0.06639305 0.005242521 0.001186719 0.1391422 0.001658285
## [,7] [,8]
## [1,] 0.03721947 0.7138955
##
## $label
## [1] "BMI" "Rin" "sv2_cat" "sv1_cat" "Ethn" "Study" "CAD"
## [8] "resid"
We are able to minimize the effect size of the study covariate from 36.6% to 0.1%, sv1 from 7% to 0.1% and sv2 from 1.6% to 0.5% and
simultaneously increase the residual from 36.9% to 72.1% which is big jump, hence effectively retreiving the lost biological signals
With this clean data we can now proceed to perform any downstream analysis.These different analysis paths are possible exploratory steps for the researcher to remove the noise in the data and enhance the ability to detect the underlying biological signal
[1] Falcon et al, An Introduction to Bioconductor’s ExpressionSet Class, 2006
[2] Bushel P (2013). pvca: Principal Variance Component Analysis (PVCA). R package version 1.6.0
[3] Leek JT, Johnson WE, Parker HS, Fertig EJ, Jaffe AE and Storey JD. sva: Surrogate Variable Analysis. R package version 3.12.0
[4] Mecham BH, Nelson PS and Storey JD (2010). “Supervised normalization of microarrays.”Bioinformatics, 26, pp. 1308-1315.