nsFilter {genefilter}R Documentation

Non-Specific Filtering of Features in an ExpressionSet

Description

This function identifies and removes features that appear to be less informative. Use cases for this function are: variable selection for subsequent sample clustering or classification tasks; independent filtering of features used in subsequent hypothesis testing, with the aim of increasing the detection rate (please see Details).

Usage

nsFilter(eset, require.entrez = TRUE, require.GOBP = FALSE, 
    require.GOCC = FALSE, require.GOMF = FALSE,
    remove.dupEntrez = TRUE, var.func = IQR, var.cutoff = 0.5, 
    var.filter = TRUE, filterByQuantile=TRUE,
    feature.exclude="^AFFX", ...)
varFilter(eset, var.func = IQR, var.cutoff = 0.5, filterByQuantile=TRUE)
featureFilter(eset, require.entrez=TRUE,
    require.GOBP=FALSE, require.GOCC=FALSE,
    require.GOMF=FALSE, remove.dupEntrez=TRUE,
    feature.exclude="^AFFX")

Arguments

eset an ExpressionSet object
require.entrez If TRUE, require that all probe sets have an Entrez Gene ID annotation. Probe sets without such an annotation will be filtered out.
require.GOBP If TRUE, require that all probe sets have an annotation to at least one GO ID in the BP ontology. Probe sets without such an annotation will be filtered out.
require.GOCC If TRUE, require that all probe sets have an annotation to at least one GO ID in the CC ontology. Probe sets without such an annotation will be filtered out.
require.GOMF If TRUE, require that all probe sets have an annotation to at least one GO ID in the MF ontology. Probe sets without such an annotation will be filtered out.
remove.dupEntrez If TRUE and there are multiple probe sets mapping to the same Entrez Gene ID, then the probe set with the largest value of var.func will be retained and the others removed.
var.func A function that will be used to assess the variance of a probe set across all samples. This function should return a numeric vector of length one when given a numeric vector as input. Probe sets with a var.func value less than var.cutoff will be removed. The default is IQR.
var.cutoff A numeric value to use in filtering out probe sets with small variance across samples. See the var.func argument and the details section below.
var.filter A logical indicating whether or not to perform variance based filtering. The default is TRUE.
filterByQuantile Logical: whether the variance-filter cutoff threshold should be interpreted as a quantile. Defaults to TRUE; if set to FALSE the cutoff value is used directly ``as is''.
feature.exclude A character vector of regular expressions. Any probe sets identifiers (return value of featureNames(eset)) that match one of the specified patterns will be filtered out. The default value is intended to filter out Affymetrix quality control probe sets.
... Unused, but available for specializing methods.

Details

Marginal type I errors: Independent filtering of features used in subsequent hypothesis testing can increase the detection rate at the same marginal type I error, as detailed in the following. Call U^1 the stage 1 filter statistic, U^2 the stage 2 test statistic for differential expression. Sufficient conditions for marginal type-I error control are:

In each of these cases, the value of U^1 for the k-th feature must depend on the data for the k-th feature only, not on any other features.

Experiment-wide type I error: Marginal type-I error control provided by the conditions above is sufficient for control of the family wise error rate (FWER). Note, however, that common false discovery rate (FDR) methods depend not only on the marginal behaviour of the test statistics under the null hypothesis, but also on their joint distribution. The joint distribution can be affected by filtering. The effect of this is negligible in many cases in practice, but this depends on the dataset and the filter used, and the assessment is in the responsibility of the data analyst. For a more comprehensive discussion, please see the reference (Bourgon et al.).

Annotation Based Filtering Arguments require.entrez, require.GOBP, require.GOCC, and require.GOMF turn on a filter based on available annotation data. The annotation package is determined by calling annotation(eset).

Duplicate Probe Removal If remove.dupEntrez=TRUE, probes determined by your annotation to be pointing to the same gene will be compared, and only the probe with the highest var.func value will be retained.

Variance Based Filtering The var.filter, var.func, var.cutoff and varByQuantile arguments control numerical cutoff-based filtering. The intention is to remove uninformative probe sets, representing genes that were not expressed at all. Probes for which var.func returns NA are removed. The default var.func is IQR, which is defined as rowQ(eset, ceiling(0.75 * ncol(eset))) - rowQ(eset, floor(0.25 * ncol(eset))); this choice is motivated by the observation that unexpressed genes are detected most reliably through their low variability across samples. Additionally, IQR is robust to outliers (see note below). The default var.cutoff is 0.5 and is motivated by the rule of thumb that in many tissues only 40% of genes are expressed. Of course, if you believe in a different approach to numerical filtering you can choose another function as var.func, or turn off numerical filtering by setting var.filter=FALSE.

Note that by default the numerical-filter cutoff is interpreted as a quantile, so leaving the default values intact would filter out 50% of the genes remaining at this stage. If you prefer to set the cutoff at some absolute threshold, change the value of varByQuantile to FALSE, and modify var.cutoff accordingly.

Note also that variance filtering is performed last, so that (if varByQuantile=TRUE and remove.dupEntrez=TRUE) the final number of genes does indeed exclude precisely the var.cutoff fraction of unique genes remaining after all other filters were passed.

The stand-alone function varFilter does only numerical filtering, and returns an ExpressionSet. featureFilter does only feature based filtering and duplicate removal, and returns an expression set as well. Duplicate removal is hard-coded to retain the highest-IQR probe for each gene.

Value

For nsFilter a list consisting of:

eset the filtered ExpressionSet
filter.log a list giving details of how many probe sets where removed for each filtering step performed.


For both varFilter and featureFilter the filtered ExpressionSet.

Note

IQR is a reasonable variance-filter choice when the dataset is split into two roughly equal and relatively homogeneous phenotype groups. If your dataset has important groups smaller than 25% of the overall sample size, or if you are interested in unusual individual-level patterns, then IQR may not be sensitive enough for your needs. In such cases, you should consider using less robust and more sensitive measures of variance (the simplest of which would be sd).

Author(s)

Seth Falcon (somewhat revised by Assaf Oron)

References

R. Bourgon, R. Gentleman, W. Huber, Independent filtering increases power for detecting differentially expressed genes, Technical Report.

Examples

  library("hgu95av2.db")
  library("Biobase")
  data(sample.ExpressionSet)
  ans <- nsFilter(sample.ExpressionSet)
  ans$eset
  ans$filter.log

  ## skip variance-based filtering
  ans <- nsFilter(sample.ExpressionSet, var.filter=FALSE)

  a1 <- varFilter(sample.ExpressionSet)
  a2 <- featureFilter(sample.ExpressionSet)

[Package genefilter version 1.24.3 Index]