gsMMD.default {GeneSelectMMD}R Documentation

Gene selection based on a mixture of marginal distributions

Description

Gene selection based on the marginal distributions of gene profiles that characterized by a mixture of three-component multivariate distributions. Input is a data matrix. The function will obtain initial gene cluster membership by its own.

Usage

gsMMD.default(X, 
              memSubjects, 
              maxFlag = TRUE, 
              thrshPostProb = 0.5, 
              geneNames = NULL, 
              alpha = 0.05, 
              iniGeneMethod = "Ttest", 
              transformFlag = FALSE, 
              transformMethod = "boxcox", 
              scaleFlag = FALSE, 
              if.center = TRUE, 
              if.scale = TRUE, 
              criterion = c("cor", "skewness", "kurtosis"), 
              minL = -10, 
              maxL = 10, 
              stepL = 0.1, 
              eps = 0.001, 
              ITMAX = 100, 
              plotFlag = FALSE,
              quiet=TRUE)

Arguments

X a data matrix. The rows of the matrix are genes. The columns of the matrix are subjects.
memSubjects a vector of membership of subjects. memSubjects[i]=1 means the i-th subject belongs to diseased group, 0 otherwise.
maxFlag logical. Indicate how to assign gene class membership. maxFlag=TRUE means that a gene will be assigned to a class in which the posterior probability of the gene belongs to this class is maximum. maxFlag=FALSE means that a gene will be assigned to class 1 if the posterior probability of the gene belongs to class 1 is greater than thrshPostProb. Similarly, a gene will be assigned to class 1 if the posterior probability of the gene belongs to class 1 is greater than thrshPostProb. If the posterior probability is less than thrshPostProb, the gene will be assigned to class 2 (non-differentially expressed gene group).
thrshPostProb threshold for posterior probabilities. For example, if the posterior probability that a gene belongs to cluster 1 given its gene expression levels is larger than thrshPostProb, then this gene will be assigned to cluster 1.
geneNames an optional character vector of gene names
alpha significant level which is equal to 1-conf.level, conf.level is the argument for the function t.test.
iniGeneMethod method to get initial 3-cluster partition of genes. Available methods are: “Ttest”, “Wilcox”.
transformFlag logical. Indicate if data transformation is needed
transformMethod method for transforming data. Available methods include "boxcox", "log2", "log10", "log", "none".
scaleFlag logical. Indicate if gene profiles are to be scaled. If transformFlag=TRUE and scaleFlag=TRUE, then scaling is performed after transformation.
if.center logical. If scaleFlag=TRUE and if.center=TRUE, then each gene profile will be centered to have mean zero.
if.scale logical. If scaleFlag=TRUE and if.scale=TRUE, then each gene profile will be scaled to have variance one.
criterion if transformFlag=TRUE, criterion indicates what criterion to determine if data looks like normal. “cor” means using Pearson's correlation. The idea is that the observed quantiles after transformation should be close to theoretical normal quantiles. So we can use Pearson's correlation to check if the scatter plot of theoretical normal quantiles versus observed quantiles is a straightline. “skewness” means using skewness measure to check if the distribution of the transformed data are close to normal distribution; “kurtosis” means using kurtosis measure to check normality.
minL lower limit for the lambda parameter used in Box-Cox transformation
maxL upper limit for the lambda parameter used in Box-Cox transformation
stepL step increase when searching the optimal lambda parameter used in Box-Cox transformation
eps a small positive value. If the absolute value of a value is smaller than eps, this value is regarded as zero.
ITMAX maximum iteration allowed for iterations in the EM algorithm
plotFlag logical. Indicate if the Box-Cox normality plot should be output.
quiet logical. Indicate if intermediate results should be printed out.

Details

We assume that the distribution of gene expression profiles is a mixture of 3-component multivariate normal distributions sum_{k=1}^{3} π_k f_k(x|theta). Each component distribution f_k corresponds to a gene cluster. The 3 components correspond to 3 gene clusters: (1) up-regulated gene cluster, (2) non-differentially expressed gene cluster, and (3) down-regulated gene cluster. The model parameter vector is theta=(π_1, π_2, π_3, μ_{c1}, σ^2_{c1}, rho_{c1}, μ_{n1}, σ^2_{n1}, rho_{n1}, μ_2, σ^2_2, rho_2, μ_{c3}, σ^2_{c3}, rho_{c3}, μ_{n3}, σ^2_{n3}, rho_{n3}. where π_1, π_2, and π_3 are the mixing proportions; μ_{c1}, σ^2_{c1}, and rho_{c1} are the marginal mean, variance, and correlation of gene expression levels of cluster 1 (up-regulated genes) for diseased subjects; μ_{n1}, σ^2_{n1}, and rho_{n1} are the marginal mean, variance, and correlation of gene expression levels of cluster 1 (up-regulated genes) for non-diseased subjects; μ_2, σ^2_2, and rho_2 are the marginal mean, variance, and correlation of gene expression levels of cluster 2 (non-differentially expressed genes); μ_{c3}, σ^2_{c3}, and rho_{c3} are the marginal mean, variance, and correlation of gene expression levels of cluster 3 (up-regulated genes) for diseased subjects; μ_{n3}, σ^2_{n3}, and rho_{n3} are the marginal mean, variance, and correlation of gene expression levels of cluster 3 (up-regulated genes) for non-diseased subjects.

Note that genes in cluster 2 are non-differentially expressed across abnormal and normal tissue samples. Hence there are only 3 parameters for cluster 2.

We apply the EM algorithm to estimate the model parameters. We regard the cluster membership of genes as missing values.

Value

A list contains 14 elements.

dat the (transformed) microarray data matrix. If tranformation performed, then dat will be different from the input microarray data matrix.
memSubjects the same as the input memSubjects.
memGenes a vector of cluster membership of genes. 1 means up-regulated gene; 2 means non-differentially expressed gene; 3 means down-regulated gene.
memGenes2 an variant of the vector of cluster membership of genes. 1 means differentially expressed gene; 0 means non-differentially expressed gene.
para parameter estimates (c.f. details).
llkh value of the loglikelihood function.
wiMat posterior probability that a gene belongs to a cluster given the expression levels of this gene. Column i is for cluster i.
wiArray posterior probability matrix for different initial gene selection methods.
memIniMat a matrix of initial cluster membership of genes.
paraIniMat a matrix of parameter estimates based on initial gene cluster membership.
llkhIniVec a vector of values of loglikelihood function.
memMat a matrix of cluster membership of genes based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership.
paraMat a matrix of parameter estimates based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership.
llkhVec a vector of values of loglikelihood function based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership.
lambda the parameter used to do Box-Cox transformation

Note

The speed of the program is slow for large data sets.

Author(s)

Weiliang Qiu stwxq@channing.harvard.edu, Wenqing He whe@stats.uwo.ca, Xiaogang Wang stevenw@mathstat.yorku.ca, Ross Lazarus ross.lazarus@channing.harvard.edu

References

Qiu, W.-L., He, W., Wang, X.-G. and Lazarus, R. (2008). A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples. The International Journal of Biostatistics. 4(1):Article 20. http://www.bepress.com/ijb/vol4/iss1/20

See Also

gsMMD, gsMMD2, gsMMD2.default

Examples

  library(ALL)
  data(ALL)
  eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"]
  mat <- exprs(eSet1)
  
  mem.str <- as.character(eSet1$BT)
  nSubjects <- length(mem.str)
  memSubjects <- rep(0,nSubjects)
  # B3 coded as 0, T2 coded as 1
  memSubjects[mem.str == "T2"] <- 1
 
  obj.gsMMD <- gsMMD.default(mat, memSubjects, iniGeneMethod = "Ttest",
          transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE)
  round(obj.gsMMD$para, 3)

[Package GeneSelectMMD version 1.0.0 Index]