Quality Control for spotted arrays
April 19, 2007
Agnes Paquet1, Andrea Barczak1, (Jean) Yee Hwa
Yang2
1. Department of Medicine, Functional Genomics Core Facility,
University of California, San Francisco
paquetagnes@yahoo.com
2. School of Mathematics and Statistics, University of Sydney, Australia
http://arrays.ucsf.edu/analysis/arrayquality.html
1. Introduction to arrayQuality
ArrayQuality is a R package, available as part of Bioconductor,
designed to help assessing quality of spotted array experiments at
several stages of the microarray lifecycle. It provides reports
containing several plots and statistical measures that can help you
determine if your hybridizations and slides are of good quality. More
information about Bioconductor is available at http://www.bioconductor.org.
This guide provides an introduction to microarray quality and a
description of the main functionnalities of the package. A full
description of the package is given by the individual function help
documents available from the R online help system. To access the online
help, type help(package=limma) at the R
prompt or else start the html help system using help.start()
or the Windows drop-down help menu.
2. Installing arrayQuality
Requirements
ArrayQuality is a library for the R project, part of Bioconductor. You
will need to have R installed on your computer before installing
arrayQuality. For more information about R, see the R project at http://www.r-project.org.
ArrayQuality can work on different files at the same time, ONLY
if they are from the SAME print-run (same GAL file). If you
want to generate quality reports of slides from different print-runs,
you need to place them in different folders, one for each print-run.
Installing arrayQuality
ArrayQuality can be installed either from Bioconductor or from
the functional Genomics Core Facility web site at
http://arrays.ucsf.edu/software. The version from Bioconductor is
updated every 6 months. If
you would like to use a more recent version, you can obtain the latest
one from http://arrays.ucsf.edu/software or from the
developmental version of
Bioconductor.
- Automatic installation from Bioconductor
This is the easiest way to install arrayQuality, as other Bioconductor packages required to use arrayQuality should be automatically installed.
Start R on your computer and make sure you are connected to the
internet. At the R prompt, type:
source("http://www.bioconductor.org/biocLite.R")
biocLite("arrayQuality")
If you are using RGUI (Windows):
In the R menu, click on "Packages". In the drop-down menu, select
"Install package(s) from Bioconductor". Then, select arrayQuality in
the list and click "OK".
- Automatic installation from the Core Facility web site
Start R on your computer and make sure you are connected to the
internet.
You can install the latest version directly from the Functional
Genomics Core Facility web site. At the R prompt, type:
install.packages("arrayQuality",
contriburl="http://arrays.ucsf.edu/software")
If you do not have privileges on your computer to write to the R
library directory, for example if you are using a shared unix machine
and you are not superuser, you may need to type instead
install.packages("arrayQuality",
contriburl="http://arrays.ucsf.edu/software" lib="myRlibdir")
- To load arrayQuality:
Type at the R prompt:
library(arrayQuality)
- Manual installation
Download arrayQuality as a .zip file (Windows users) from Bioconductor
or our web site. Start R on your computer and make sure you are
connected to the internet.
In the R console, click on "Packages".
In the drop-down menu, select "Install package(s) from local zip
file..."
Browse to the zip file you would like to install, and click "Open".
3. Quick starting guide to arrayQuality
General hybridization quality
This component of the package is aimed at verifying the performance of
your hybridization, given the good quality of the slide, before any
preprocessing steps or further quality assessment on individual spots.
Our package provides two kinds of quality control plots:
- A qualitative quality control measurement as a diagnostic plot.
It is a quick visual way to determine hybridization quality gathering
information from several statistical tools.
- A more quantitative comparison of slide quality as a comparative
boxplot. We extract some statistical measures from the test slide and
we compare them against results of “good quality” slides to assess the
quality of the hybridization.
Diagnostic plots can be generated directly for output of the GenePix
(.gpr files), Spot (.spot files)and Agilent image processing software
packages.
Most arguments can be customized to match your own data: which probes
are used as controls, which columns of the image processing output file
are used to define your spot types... You can also specify your own
collection of good quality slides. For more details, please refer to
other sections of this manual.
To generate quality plots from image processing output files
We provide 3 main functions to generate quality plots:
gpQuality(), spotQuality() and agQuality().
We will use gpQuality as an example, but the following can be directly
applied to spotQuality or agQuality.
- Create a directory and move the image processing output files
(e.g. .gpr files) of the slides of interest to this directory. Make
sure that all files in the directory come from the SAME print-run (same
GAL file).
- Start R, and change R working directory to the one you have just
created. In the R menu, select File, then click on “Change dir…”.
Browse to your directory from the pop-up window, or enter it manually,
and click OK. To double check that you are in the correct directory: in
the File menu, click on “Display file(s)…”.
- To load the package in your R session: type
library(arrayQuality)
If needed, you may have to install other required packages like marray,
limma, convert and hexbin.
- To generate both diagnostic plots and comparative boxplots
on all files in the directory, type:
result <- gpQuality(organism=”Mm”)
- To generate diagnostic plots only, run:
result <- gpQuality(organism="Mm",
compBoxplot="FALSE")
In this case, quantitative quality measures will not be calculated and
the HTML report will not be generated.
- To write down your quantitative quality measures and your
normalized data to a file: set output = TRUE when calling gpQuality:
result <- gpQuality(organism="Mm",
output=TRUE)
- By default, arrayQuality uses print-tip loess normalization. If
you prefer to use another method, you can specify it in the norm
argument:
result <- gpQuality(norm="none")
For more details about normalization methods, please refer to marray
package help.
To generate quality plots from marrayRaw or RGList objects
You can use the function maQualityPlots()
to generate diagnostic plots directly from your R object.
- To generate diagnostic plots: if rawdata is a marrayRaw or RGList
object, type in R:
maQualityPlots(rawdata)
Results
gpQuality, spotQuality and agQuality output:
- In the working directory:
- two plots for each test slide (a diagnostic plot and a comparative
boxplot)
- a HTML quality report
- In the R global environment, the function returns a list of two
objects:
- A marrayRaw object describing all tested slides - A quality measures
matrix: this matrix contains all comparison measures values extracted
for each test slide. Each column of the matrix represents a different
slide.
Print-run quality control
This component of arrayQuality provides diagnostic plots for 9-mers
hybridization and Quality Control hybridization.
9-mers analysis: PRv9mers()
This type of hybridization, which we term “9mers hyb”, uses small
oligonucleotides (random 9-mers) which will hybridize to each probe on
the arrays. This will help to determine the quality of spot morphology
as well as the presence or absence of spotted oligonucleotides. The
resulting data will be used to create a list of all missing spots.
In the package, the graphical function to assess 9mers hybridization
quality is PRv9mers(). It runs using one single command line script.
To generate diagnostic plots
- Copy all 9-mers hybridizations gpr files from the SAME print-run
(same GAL file) to a directory.
- Start R and change R working directory to the one containing your
gpr files (see above for more details).
- Load the package in your R session:
library(arrayQuality)
If needed, you may have to install other required packages like marray,
limma, convert and hexbin.
- To generate diagnostic plots, type:
PRv9mers(prname=”12Mm”)
The prname argument represents the name of your print-run. For more
details about other arguments, please refer to the online manual.
Results
PRv9mers()provides the following results:
- Diagnostic plots as image in .png format for each tested slide
- An Excel file (typically named 9Mm9mer.xls, where 9Mm is the name
of your print-run, as passed to prname) containing for each spot on the
slide:
- Name and ID of the spot
- The probability of being present or absent (p from EM
algorithm). If several files are tested together, you will have a
probability of being present/absent for each file. A spot is considered
absent if p < 0.5.
- The average probability of being present or absent.
- The raw signal intensity (Signal column) or average raw
signal intensity if several files are tested together for each spot.
- An Excel file (typically named 9MmMissing.xls, where 9Mm is the
name of
your print-run, as passed to prname) containing information on missing
probes only:
- Name and ID of the spot
- The probability of being present or absent (p from EM
algorithm). If several files are tested together, you will have a
probability of being present/absent for each file. A spot is considered
absent if p < 0.5.
- The average probability of being present or absent.
- The raw signal intensity (Signal column) or average raw
signal intensity if several files are tested together for each spot.
- A text file (typically named 9MmQuickList.txt, where 9Mm is the
name of your print-run, as passed to prname) containing the missing
probes ids, each on a separate line. This file can be opened in any
word processing programd as well as being a "quick list" in Acuity
software (http://www.axon.com/gn_Acuity.html).
Quality Control hybridization: PRvQCHyb()
9-mers hybridizations help verify that oligonucleotides have been
spotted properly on the slides. The next print-run quality control step
will be:
- Detect any difference in overall signal intensity compared to
other print-runs
- 70-mers oligonucleotides hybridizations
- Selection of several test slides to ensure that the same
quantity of material was spotted across the platter, as a print-run
will generate 255 slides using the same well for one probe. QCHybs are
performed using one slide from the beginning of the print, one from the
middle, one from the end (e.g. numbers 20,100 and 255 in the Functional
Genomics Core Facility).
- Check if the GAL file was generated properly, i.e. check that no
error
was made with ordering or orientation of the plates during the print.
- Reproducibility:
A good way to verify the quality of a new print is to hybridize known
samples to new slides. Then, we can compare signal intensity from the
new slides to existing data, and check that there is no loss in signal.
Log ratios (M) for known samples should be similar across print-runs.
Example of samples used for QCHybs includes Human Reference pool, Mouse
liver, Mouse lung, with dye swaps.
To generate diagnostic plots
- Create a directory and move the image processing output files
(.gpr files only) of the slides of interest to this directory. Make
sure that all files in the directory come from the SAME print-run (same
GAL file).
- Start R, and change R working directory to the one you have
created (see general hybridization paragraph above for more details)
- Load the package in your R session:
library(arrayQuality)
If needed, you may have to install other required packages like marray,
limma, convert and hexbin.
- For QCHyb analysis, run the following command:
PRvQCHyb()
Results
PRvQCHyb() returns a diagnostic plot as an image in .png format for
each tested slide.
Introduction to microarray quality
A microarray experiment is composed of several steps, including
experimental design, sample preparation, and various statistical
analyses (figure 1). They are represented in the microarray lifecycle
below. As microarray technology is complex and sensitive, it is
important to assess the performance of each step before going to the
next one. In addition, this is also a good way to trace back the cycle
to understand potential causes for upstream problems.
Figure 1: Microarray experiment lifecycle
For spotted array experiments, quality controls can be summarized into
4 steps:
- Print quality
- 9mers hybridization
- Quality Control hybridization
- mRNA quality
- Array hybridization quality
- Spot quality
Each step must be performed in a sequential order, as represented in
Figure 2.
Figure 2: Quality Control for spotted arrays
experiment
Our package provides graphical tools to look at two of these
components: print-run quality and array hybridization quality.
- Print quality:
This component is highly tailored to the Shared Genomics Core Facility
at UCSF, but the framework can be adapted to other Core facilities or
laboratories printing their arrays. It is an essential component of a
printed array experiment, as any print pin, probe or slide surface
defect will affect the quality of hybridization to the slide, and this
can’t be fixed by statistics. Only prints that did pass the quality
control check will be used for actual hybridization.
- Hybridization quality:
This is a global assessment of the hybridization performance. It helps
determine for example any problem with the dyes, or uneven
hybridization. Then, once you have determined that your hybridization
is good, you can look at each individual spot quality, remove bad
spots, and perform statistical analysis.
3. Print-run quality control
When a print-run is completed, it is necessary to verify the quality of
the resulting arrays. This can be done by using two kinds of
hybridization to the new slides. The first type of hybridization, which
we term “9mers hyb”, uses small oligonucleotides (random 9-mers), which
will hybridize to each probe. This hybridization will help to determine
the quality of spot morphology as well as the presence or absence of
spotted oligonucleotides. The resulting data will be used to create a
list of all missing spots.
The second type of hybridization, which we will term Quality Control
Hybridization (QCHyb), uses mRNA from predefined cell lines (e.g. liver
vs. pool, K562 vs. Human Universal Reference pool from Stratagene).
These hybridizations can be use as a more quantitative description of
the slides. The same comparison hybridizations are done for different
print-run, assessing their reproducibility. QCHybs are also used to
verify accuracy of GAL files, number of missing spots, binding
capacity, background signal intensity…
The arrayQuality package provides specific tools to help assess quality
of slides for both 9-mers and QC hybridization.
3.1 9-mers hybridizations
In the package, the graphical function to assess 9mers hybridization
quality is PRv9mers(). It runs using one single command line script. To
use it:
- Copy all 9-mers hybridizations gpr files from the SAME print-run
(same GAL file) to a directory.
- Change R working directory to the one containing your gpr files
as described in section 1.
- Type:
PRv9mers(prname=”12Mm”).
The prname argument represents the name of your print-run. For more
details about other arguments, please refer to the online manual.
Results
PRv9mers() provides the following results:
- Diagnostic plots as image in .png format for each tested slide
- An Excel file (typically named 9Mm9mer.xls, where 9Mm is the name
of your print-run, as passed to prname) containing for each spot on the
slide:
- Name and ID of the spot
- The probability of being present or absent (p from EM
algorithm). If several files are tested together, you will have a
probability of being present/absent for each file.
A spot is considered absent if p < 0.5.
- The average probability of being present or absent.
- The raw signal intensity (Signal column) or average raw
signal intensity if several files are tested together for each spot.
- An Excel file (typically named 9MmMissing.xls, where 9Mm is the
name of your print-run, as passed to prname) containing information on
missing probes only:
- Name and ID of the spot
- The probability of being present or absent (p from EM
algorithm). If several files are tested together, you will have a
probability of being present/absent for each file.
A spot is considered absent if p < 0.5.
- The average probability of being present or absent.
- The raw signal intensity (Signal column) or average raw
signal intensity if several files are tested together for each spot.
- A text file (typically named 9MmQuickList.txt, where 9Mm is the
name of your print-run, as passed to prname) containing the missing
probes ids, each on a separate line. This file can be opened in any
word processing program.
Description of the diagnostic plots
Figure 3 shows an example from a typical 9-mers hybridization. This
image is divided in 5 plots.
- The first column (left) represents boxplots of log intensity, by
plates (top) and by print-tip group (bottom). In this example, you will
notice on the boxplot by plates (top left corner) that plates 44 and 48
have lower intensity and wider range than the others. Both plates
contain mostly empty controls, as designed by Operon.
- Central plot: spatial plot of intensity. This helps to locate
missing spots. The color scale reflects the signal intensity, the
darker the color of the plot, the stronger the signal. Missing spots
are represented in white. In Figure 3 spatial plot, top right corner
white spots come from the empty spots.
- Right column: Density plot of the foreground and background log
intensity.
- Foreground density plot: it should be composed of 2 peaks. A
smaller peak in the low intensity region containing missing spots and
negative control spots, and a higher one representing the rest of the
spots (probes). The number of present and absent spots, excluding empty
controls, estimated by EM algorithm is indicated on the graph.
- Background density plot: one peak in the low intensity
region. If a slide is of good quality, the background peak should not
overlap too much with the foreground peak corresponding to the bulk of
the data.
Density plots are used to compare foreground and background peaks,
using the X-axis scale. They should be clearly separated. The number of
missing spots should be low. Missing spots ids may be incorporated in
the analysis later, e.g. by down weighting them in linear models.
Examples
This example uses 9-mer hybridization data performed in the Functional
Genomics Core Facility in UCSF. This print-run was created using Operon
Version 2 Mouse oligonucleotides.
> library(arrayQuality)
> datadir <- system.file("gprQCData", package="arrayQuality")
> PRv9mers(fnames="12Mm250.gpr",path=datadir, prname="12Mm")
Figure 3: Example of diagnostic plot for 9-mers
hybridization
9-mers hybridizations help verify that oligonucleotides have been
spotted
properly on the slides. The next print-run quality control step will be:
1.
Detect any difference in overall signal intensity
compared to other print-runs
a.
70-mers oligonucleotides hybridizations
b.
Selection of several test slides to ensure that
the same
quantity of material was spotted across the platter, as a print-run
will
generate 255 slides using the same well for one probe. QCHybs are
performed
using one slide from the beginning of the print, one from the middle,
one
from the end (e.g. numbers 20,100 and 255 in the Functional Genomics
Core
Facility).
2.
Check if the GAL file was generated properly, i.e.
check
that no error was made with ordering or orientation of the plates
during the
print.
3.
Reproducibility:
A good way to verify
the
quality of a new print is to hybridize known samples to new slides.
Then, we
can compare signal intensity from the new slides to existing data, and
check
that there is no loss in signal. Log ratios (M) for known samples
should be
similar across print-runs. Example of samples used for QCHybs includes
Human
Reference pool, Mouse liver, Mouse lung, with dye swaps.
The function in the package which performs the
quality
assessment for QCHybs is PRvQCHyb().
-
Copy the QCHybs gpr files from the SAME
print-run
(same GAL file) in a directory.
-
Change R working directory to the one containing
your
gpr files as described in section 1.
-
Type:
> PRvQCHyb(prname="9Mm")
where prname is the
name
of the print-run. For more details about its arguments, please refer to
the
online manual.
PRvQCHyb()
returns a diagnostic plot as an image in .png format for each tested
slide.
Throughout our document, we will be using the
color code
described in Table 1 to highlight control spots.
Positive controls
|
Red
|
Empty controls
|
Blue
|
Negative controls
|
Navy Blue
|
Probes
|
Green
|
Missing spots
|
White |
Table 1: Color code
used in arrayQuality
Restrictions:
Currently, PRQCHyb()
supports Mouse genome (Mm) only. We will add Human data as soon as it
becomes
available.
Figure 4 shows an example of a nice print-run
QCHyb.
- MA-plot of raw M values. No background
subtraction is performed. The colored lines represent the loess curves
for each print-tip group. The red dots highlight any spot with
corresponding weighted value less than 0. Users can create their
own weigthing scheme or function. Things to look for in a MA-plot are
saturation of spots and the trend of loess curves, which is an
indicator of the amount of normalization to be performed.
- Boxplot of raw M values by print-tip
group, without background subtraction.
- Spatial plot of rank of raw M values
(no background subtraction): Each spot is ranked according to its M
value. We use a blue to yellow color scale,where blue represents the
higher rank (1), and yellow represents the lower one. Missing spots are
represented as white squares.
- Spatial plot of A values. The color
indicates the strength of the signal intensity, i.e. the darker the
color, the stronger the signal. Missing spots are represented in white.
- Histogram of the signal-to-noise
log-ratio (SNR) for Cy5 and Cy3 channels. The mean and the variance of
the signal are printed on top of the histogram. In addition, overlay
density of SNR stratified by different control types (status) are
highlighted. Their color schemes are provided in Table 1. The SNR is a
good indicator for dye problems. The negative controls and empty
controls density lines should be closer, almost superimposed.
- Comparison of Mvalues of probes known
to be differentially expressed from the tested array to average Mvalues
obtained during previous hybridizations. This plot is aimed at
verifying the reproducibility of print-runs. The dotted lines are the
diagonal (no change) and the +2/-2 fold change lines. Each probe is
represented by a number, and described in the file MmDEGenes.xls. Most
of the spots should lie between the +2/-2 fold-change regions. If the
technique was perfect, you should see a straight line on the diagonal.
If any probe falls off this region (number 29 here), you can look up
its number in our probe list in MmDEgenes.xls and get more information
about it.
- Dot plot of controls A values, without
background subtraction. Controls with more than 3 replicates are
represented on the Y-axis, the color scheme is represented in Table 1.
Intensity of positive controls should be in the high-intensity region,
negative and empty controls should be in the lower intensity region.
Positive controls range and negative/empty controls range should be
separated. Replicate spots signal should be tight.
Data for this example was provided by the
Functional
Genomics Core Facility in UCSF. We have tested slide number 137 from
print-run 9Mm. This print-run uses Operon Version 2 Mouse oligos.
Results are
represented figure 4.
>
library(arrayQuality)
> datadir <-
system.file("gprQCData", package="arrayQuality")
>
PRvQCHyb(fnames=”9Mm137.gpr”, path=datadir, prname="9Mm")
Figure
4:
Diagnostic plot for print-run Quality Control hybridization
This component is aimed at verifying the
performance of
your hybridization, given the good quality of the slide, before any
preprocessing steps or further quality assessment on individual spots.
This
is where you determine if your experiment quality is good enough to
enter
your dataset. For example, you will need to remove any hybridization
with
very low SNR, or large spatial artifacts.
Our package provides two kinds of quality control
plots.
The first one is a qualitative quality control measurement as a
diagnostic
plot. It is a quick visual way to determine hybridization quality
gathering
information from several statistical tools. More details on individual
diagnostic plots can be found in the vignette “marrayPlots” in
the package marray. The
second
one is a more quantitative comparison of slide quality. We extract some
statistical measures from the test slide and we compare them against
results
obtained for a collection of slides of “good quality” to assess
the quality of the hybridization. This comparison is visualized through
a
comparative boxplot. Results are displayed in a HTML report. Figure 5
shows a
screen shot of a typical HTML report. Users can click on each image to
obtain
a higher resolution plot.
Diagnostic plots can be generated for
different image processing software format: GenePix format files (.gpr
files), Spot format files (.spot) and Agilent format files, or from marrayRaw
or RGList
objects. Most arguments can also be customized to match your own data:
which probes are used as controls, which column of the image processing
output file is used
to define your spot types... You can also specify your own collection
of good quality
slides using the functions globalQuality and qualRefTable. For more
details about
these functions, please refer to the online help and the example at the
end
of this Section.
To generate quality plots: gpr files: gpQuality()
We provide 3 main functions to generate quality plots:
gpQuality(), spotQuality() and agQuality().
We will use gpQuality as an example, but the following can be directly
applied to spotQuality or agQuality. gpQuality()will
generate both diagnostic plots and comparative boxplots. It uses by
default spot types from the Functional Genomics Core Facility in UCSF.
To use your own spot types, please refer to the end of this Section.
-
Copy the gpr files from the SAME
print-run (same GAL file) in a directory.
-
Change R working directory to the one containing
your
gpr files as described in Section 1
-
To generate both
diagnostic plots and comparative boxplots on all files in the
directory, run:
> result <-
gpQuality(organism=”Mm”)
-
To generate
diagnostic
plots only, run:
> result <-
gpQuality(organism="Mm", compBoxplot="FALSE")
In this case,
quantitative
quality measures will not be calculated and the HTML report will not be generated.
-
To
write down your quantitative quality
measures and your normalized data to a file: set output = TRUE when calling gpQuality:
> result <- gpQuality(organism="Mm",
output=TRUE)
This command will create
two files: quality.txt, which contains your quality measures, and
NormalizedData.xls, which contains your normalized M values. If you
have set
compBoxplot =
FALSE, quantitative quality
measures are not calculated. Therefore, you will not generate the
quality.txt
file.
To generate quality plots from marrayRaw/RGList
objects: maQualityPlots
This function can be
use to obtain quality plots for data generated with other image
processing software, like Spot for example. maQualityPlots()
will generate diagnostic plots only. It uses the spot types defined
when creating the R object. To learn more about how to read data into a
marrayRaw
or a RGList object,
please refer to marray
or limma packages Vignettes.
-
To
generate diagnostic plots: if rawdata is
your marrayRaw/RGList
object, type:
>
maQualityPlots(rawdata)
gpQuality() outputs
- In the working directory:
-
two plots for each test slide (a diagnostic plot
and a
comparative boxplot)
-
a HTML quality report
- In the R global environment, the function returns a list of two
objects:
- A
marrayRaw
object describing all tested slides
- A quality
measures matrix: this matrix contains all comparison measures values
extracted for each test slide. Each column of the matrix represents a
different slide.
For each slide, you will
find
on the report how many of your slide’s results are below the
recommended range. If you want to specify a directory to store the
results,
you can do it by modifying the argument resdir accordingly.
For
more details about gpQuality arguments, please
refer to
the online manual.

Figure 5: Example of
HTML report generated by gpQuality
maQualityPlots()
output:
- In the working directory:
-
one diagnostic plot for each test slide
gpQuality
calls two key functions, maQualityPlots
and qualBoxplot.
qualBoxplot supports
Mouse (Mm) and
Human (Hs) genomes only. To generate quality plots for other genomes,
you
need to set gpQuality argument
compBoxplot = FALSE. In
this
case, only the diagnostic plots will be generated.
Description of the diagnostic plots:
Figure 6 represents an example of a good
hybridization
diagnostic plot.
- MA-plot of raw M. No background
subtraction is performed. The colored lines represent the loess curves
for each print-tip group. The red dots highlight any spot with
corresponding weighted value less than 0. Users can create their
own weigthing scheme or function. Things to look for in a MA-plot are
saturation of spots and the trend of loess curves, which is an
indicator of the amount of normalization to be performed.
- MA-plot of normalized data density. By
default, print-tip loess normalization is used. Instead of the typical
MA-plot, we have used the package "hexbin" to highlight density
of dots on the MA-plot. A light yellow color indicates a high density
of dots, whereas blue color represents a lower density. This plot gives
you information on the bulk of your data intensity (low/high signal)
- Spatial plot of rank of raw M values
(no background subtraction): Each spot is ranked according to its M
value. We use a blue to yellow color scale,where blue represents the
higher rank (1), and yellow represents the lower one. Missing spots are
represented as white squares. This is a quick way to visually detect
uneven hybridization and missing spots.
- Spatial plot of normalized M values
ranks. By default, print-tip loess normalization is used. Each spot is
ranked according to its M value. We use a blue to yellow color
scale,where blue represents the higher rank (1), and yellow represents
the lower one. Missing spots are represented as white squares. In
addition, flagged spots are higllighted by a black square. This type of
graphical representation helps verify that normalization removed any
spatial effects.
- Spatial plot of raw A values. The
color indicates the strength of the signal intensity, i.e. the darker
the color, the stronger the signal. Missing spots are represented in
white.
- Histogram of the signal-to-noise
log-ratio (SNR) for Cy5 and Cy3 channels. The mean and the variance of
the signal are printed on top of the histogram. In addition, overlay
density of SNR stratified by different control types (status) are
highlighted. Their color schemes are provided in Table 1. The SNR is a
good indicator for dye problems. The negative and empty controls
density lines should be closer, almost superimposed.
- Dot plot of controls normalized M
values. Controls with more than 3 replicates are represented on the
Y-axis, the color scheme is represented in Table 1. Controls M values
should be tight. and close to 0.
- Dot plot of controls A values, without
background subtraction. Controls with more than 3 replicates are
represented on the Y-axis, the color scheme is represented in Table 1.
Intensity of positive controls should be in the high-intensity region,
negative and empty controls should be in the lower intensity region.
Positive controls range and negative/empty controls range should be
separated.
Figure 7 shows an example of a comparative boxplot.
We have chosen a wide range of measures to
quantify the
quality of a typical hybridization: single channel measures (range of
foreground signal, MAD of background, signal to noise ratio…), two
channel measures (median A values for each type of controls, amount of
normalization needed…), percentage of flagged spots... Some measures
have been negated such that the quality scale had an increasing trend
from
problematic to good quality.
For each measure, we have represented the
following on
the graph :
-
Boxplot of the reference slides values.
-
1st and 3rd quantiles before
scaling for each boxplot.
-
Y-axis on the right : for each measure, we have
printed
2 values. The first one is the percentage of reference slides measures
under
your slide’s result. The second one is your slide value for this
measure before scaling.
- We have scaled all the
results
to be able to compare them on the same graph.
- The red dots are the test slide
scaled values
The 16 measures we have selected are listed below.
1. rangeRf: Range of Cy5 foreground,
where the
range is defined by:
rangeRf = max(log2
(median
Cy5 foreground)) - min(log2(median Cy5 foreground))
where median Cy5 foreground corresponds to the "F635 Median" column of
the
gpr file.
2. rangeGf: Range of Cy3
foreground, where the
range is defined by:
rangeGf = max(log2
(median
Cy3 foreground)) - min(log2(median Cy3 foreground))
where median Cy3 foreground corresponds to the "F532 Median" column of
the
gpr file.
3. -RbMad: Cy5 background MAD
RbMad = mad[log2(Cy5
background)]
where:
- Cy5 background corresponds to the "B635 Median" column of the gpr file
- MAD = median{
| Y
–mu | }, when Y is normal
4.
-GbMad: Cy3 background MAD
GbMad = mad[log2(Cy3
background)]
where:
- Cy3 background corresponds to the "B532 Median" column of the gpr file
- MAD =
median{
| Y
–mu | }, when Y is normal
5.
Median RS2N:
Median
Signal To Noise log-ratio for Cy5
RS2N = log2( mean Cy5 foreground / Median Cy5
background
)
RS2Nmedian = median(RS2N)
where:
- mean Cy5 foreground is the "F635 Mean" column of the gpr file
- median Cy5 background is the "B635 Median" column of the gpr file
6. Median GS2N: Median Signal To Noise
log-ratio
for Cy3
GS2N = log2( mean Cy3 foreground / Median Cy3
background
)
GS2Nmedian = median(GS2N)
where:
- mean Cy3 foreground is the "F532 Mean" column of the gpr file
- median Cy3 background is the "B532 Median" column of the gpr file
7. -Median A for empty control:
Median A = [
log2(median Cy5
foreground) + log2(Cy3 foreground) ] / 2
Median A for empty
control =
median( A(Empty controls))
where:
- median Cy5 foreground corresponds to the "F635 Median" column of the
gpr
file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- Empty controls are the probes labelled "Empty"
8. -Median A for
negative
control:
Median A = [
log2(median Cy5
foreground) + log2(Cy3 foreground) ] / 2
Median A for negative
control
= median( A(Negativecontrols))
where:
- median Cy5 foreground corresponds to the "F635 Median" column of the
gpr
file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- Negative controls are the probes labelled "Negative"
9. Median A
values
for Positive controls:
A = [ log2(median Cy5
foreground) + log2(Cy3 foreground) ] / 2
Median A for positive
control
= median( A(Positive controls))
where:
- median Cy5 foreground corresponds to the "F635 Median" column of the
gpr
file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- Positive controls are the probes labelled "Positive"
10. Difference between A values for Positive
controls
and A values for Negative controls
difference =
median( A(Positive controls)) - median( A(Negativecontrols))
11. -varRepA:
variance
of replicates spots A values
varRepA = var[
A(replicates)
]
where:
- A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2
- median Cy5 foreground corresponds to the "F635 Median" column of the
gpr
file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
12. -msePtip:
MSE of M
values by print-tip group, no background subtraction
M =
log2(Median
Cy5 foreground) - log2(Median Cy3 foreground)
msePtip = MSE( mean M by print-tip)
where:
- median Cy5 foreground corresponds to the "F635 Median" column
of the
gpr file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- MSE(X) = E( (X-t)2
), with t a
parameter and X an estimator of t.
13. -mseFit:
MSE of
lowess curve
fit = lowess(A,
M)
mseFit = MSE(fit$y)
where:
- A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2
- M = log2(Median Cy5 foreground) - log2(Median Cy3 foreground)
- median Cy5 foreground corresponds to the "F635 Median"
column of
the gpr file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- MSE(X) = E( (X-t)2
), with t a
parameter and X an estimator of t.
14. -Percentage of flagged spots
[number of spot with flag < 0 / number of spots] * 100
where flag is the
information
from the "Flags" column of the gpr file. Only spots with flag less than
0 are
taken into account.
15. -M values MMRmad
MMR = Mmean –
Mmedian
MAD(MMR)
where:
- Mmean = log2(Mean Cy5 foreground) - log2(Mean Cy3 foreground)
M values calculated using mean signal
- Mmedian = log2(Median
Cy5
foreground) - log2(Median Cy3 foreground)
M values calculated using median signal
- mean Cy5 (Cy3)
foreground
is the "F635 Mean" ("F532 Mean") column of the gpr file
- median Cy5 (Cy3) foreground is the "F635 Median" ("F532
Median")
column of the gpr file
- MAD = median{
| Y
–mu | }, when Y is normal
16. -Percentage of spots with abs[MMR] >
0.5
where:
- MMR = Mmean – Mmedian
- Mmean =
log2(Mean Cy5
foreground) - log2(Mean Cy3 foreground)
M values calculated using mean signal
- Mmedian = log2(Median
Cy5
foreground) - log2(Median Cy3 foreground)
M values calculated using median signal
- mean Cy5 (Cy3) foreground is the "F635 Mean" ("F532 Mean") column of
the
gpr file
- median Cy5 (Cy3) foreground is the "F635 Median" ("F532
Median")
column of the gpr file
Data for this example was provided by the
Functional
Genomics Core Facility in UCSF. We have tested slide number "137" from
print-run "9Mm". This array was fabricated using Operon Version 2 Mouse
oligos and the hybridization measures differential gene expression in
two RNA
samples, Mouse Liver and Mouse Reference Pool. Results are represented
Figure
5 and Figure 6.
To
generate
diagnostic plots, comparative boxplots, HTML report and to write your
quality
measure and normalized data to a file in a directory named "Results":
>
library(arrayQuality)
> datadir <-
system.file("gprQCData", package="arrayQuality")
> result <-
gpQuality(fnames = "9Mm137.gpr", path =
datadir,organism = ”Mm”, output = TRUE, resdir =
"Results")

Figure 6: General hybridization quality diagnostic plot
Figure
7:
Comparative boxplot
Customizing gpQuality
arrayQuality
is currently using look-up tables adapted to hybridizations
performed
in the Functional Genomics Core Facility in UCSF. Depending on
your
data, you may find that the probes defined as controls in arrayQuality are
not present on your array, leading to NAs in the comparative boxplot,
or you
may be working with a genome for which we are not providing references. gpQuality has
several arguments that you can modify in order
to use your own spot types or your own collection of good slides.
gpQuality arguments are listed below:
gpQuality(fnames = NULL, path = ".", organism = c("Mm", "Hs"),
compBoxplot = TRUE,
reference = NULL,
controlMatrix =
controlCode, controlId = c("ID", "Name"),
output = FALSE,
resdir =".", dev= "png", DEBUG = FALSE,...)
To use your own set of spot types (i.e. controls...): you will need to
change controlMatrix
and/or controlId.
To use your own collection of good slides: you will need to modify reference.
To use your own set of spot types:
The spot types used in arrayQuality are defined in a 2 column matrix
called
controlCode.
Pattern
|
Name
|
Buffer
|
Buffer
|
Empty
|
Empty
|
EMPTY
|
Empty
|
AT
|
Negative
|
M200009348
|
Positive
|
M200003425
|
Positive
|
NLG
|
con
|
Table2: Examples of controls used in arrayQuality
To define your own spot types, you will need to replace the default
values in controlCode with your values. The easiest way to do it is to
create a tab-delimited text file named SpotTypes.txt, and read it
into arrayQuality using the function readcontrolCode. It is also
possible to create a new controlCode matrix directly.
1. If you want to use a Spot Types file:
A spot types files is a tab-delimited text file which allows you to
identify different types of spots from the gene list. It should contain
at least a column named SpotType where all different spot types are
listed and one or more other columns, which should have the same names
as columns in the GAL file, containing patterns or regular expressions
sufficient to identify the spot-type. For more information, you can
refer to the limma
package userguide.
Warning: You will need to
include a spot type named probes!!
Below is an example of spot types files for the swirl dataset. In this
case there are only two types of spots, probes and controls.
Example of spot types file
To read the new spot types in arrayQuality:
- Create your spot types file.
- Find which column of the file contains probes identification for each
type. In the example Figure 8, it is the "ID" column. You will need to
pass this column name as argument at the next step.
- Read the spot types files using the readcontrolCode function.
>
controlCode <-
readcontrolCode(file=”mySpotTypes.txt”, controlId="ID")
- Find which column
of the gpr file can be used to identify your new spot types. It is
typically the "ID" or the "Name" column.
- To generate both types of plots: call gpQuality
specifying your new controlCode matrix in controlMatrix
and which column is used to define your spot types in controlId.
>
result <- gpQuality(controlMatrix = controlCode, controlId=”Id”)
2. If you want to create a new controlCode
matrix directly
You will need to create another controlCode table
containing two columns as well, and then overwrite the default controlCode loaded
with arrayQuality.
- A column named "Pattern" containing your control
IDs
- A column named "Name", describing what king of
control
is each probe (in particular what are Positive, Negative, Empty
controls)
You can do it by creating a tab delimited text file and read it in R
after
loading arrayQuality:
>
library(arrayQuality)
> mycontrolCode <- as.matrix(read.table("mycontrolCode.txt",
sep="\t",
header=TRUE, quote="\"", fill=TRUE)))
Then, pass your
new matrix as argument when calling gpQuality. You can specify which
column of the gpr file contains probes identifiers in the controlId
arguments (typically, it would be "Id" or "Name").
> results <- gpQuality(controlMatrix = mycontrolCode, controlId =
"ID")
To use your own reference slides:
If you would like to use your own set of reference slides, you
will
need to follow a few steps to create the necessary look-up tables. This
feature can be used for example if you want to study hybridization
quality
for other genomes, or if you would like to compare slide quality within
a a
large dataset. To generate your own references:
1. Gather the slides of "good" quality you would
like to
use as reference in a directory, for example "MyReferences". Slides can
be
from different print-runs.
2. Change R working directory to "MyReferences", as
described in Section 1.
3. Load arrayQuality package by typing library(arrayQuality) in R
4. Create your reference quality measures by typing:
>
myReference
<- globalQuality()
5. Change R working directory to the directory
containing
slides you would like to test, as described in Section 1. You can
only
compare slides from the same print-run here. If you have an experiment
using two print-runs, you will
need to
run gpQuality two times.
6. Run gpQuality using the reference measures and
the
scaling table you have generated:
>
results <-
gpQuality(reference = myReference)
Other gpQuality
arguments described above can also be applied here.