readAligned {ShortRead} | R Documentation |
readAligned
reads all aligned read files in a directory
dirPath
whose file name matches pattern
,
returning a compact internal representation of the alignments,
sequences, and quality scores in the files. Methods read all files into a
single R object; a typical use is to restrict input to a single
aligned read file.
readAligned(dirPath, pattern=character(0), ...)
dirPath |
A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) of aligned read files to be input. |
pattern |
The (grep -style) pattern describing file
names to be read. The default (character(0) ) results in
(attempted) input of all files in the directory. |
... |
Additional arguments, used by methods. When dirPath
is a character vector, the argument type must be
provided. Possible values for type and their meaning are
described below. Most methods implement filter=srFilter() ,
allowing objects of SRFilter to selectively
returns aligned reads. |
There is no standard aligned read file format; methods parse particular file types.
The readAligned,character-method
interprets file types based
on an additional type
argument. Supported types are:
type="SolexaExport"
This type parses .*_export.txt
files following the
documentation in the Solexa Genome Alignment software manual,
version 0.3.0. These files consist of the following columns;
consult Solexa documentation for precise descriptions. If parsed,
values can be retrieved from AlignedRead
as
follows:
alignData
alignData
alignData
alignData
alignData
sread
quality
chromosome
position
strand
alignQuality
alignData
Paired read columns are not interpreted. The resulting
AlignedRead
object does not contain a
meaningful id
; instead, use information from
alignData
to identify reads.
Different interfaces to reading alignment files are described in
SolexaPath
and SolexaSet
.
type="SolexaPrealign"
type="SolexaAlign"
type="SolexaRealign"
These types parse s_L_TTTT_prealign.txt
,
s_L_TTTT_align.txt
or s_L_TTTT_realign.txt
files
produced by default and eland analyses. From the Solexa
documentation, align
corresponds to unfiltered first-pass
alignements, prealign
adjusts alignments for error rates
(when available), realign
filters alignments to exclude
clusters failing to pass quality criteria.
Because base quality scores are not stored with alignments, the
object returned by readAligned
scores all base qualities as
-32
.
If parsed, values can be retrieved from
AlignedRead
as follows:
sread
alignQuality
alignData
position
strand
readXStringColumns
alignData
type="SolexaResult"
This parses s_L_eland_results.txt
files, an intermediate
format that does not contain read or alignment quality
scores.
Because base quality scores are not stored with alignments, the
object returned by readAligned
scores all base qualities as
-32
.
Columns of this file type can be retrieved from
AlignedRead
as follows (description of
columns is from Table 19, Genome Analyzer Pipeline Software User
Guide, Revision A, January 2008):
sread
alignData
as
matchCode
. Codes are (from the Eland manual): NM (no
match); QC (no match due to quality control failure); RM (no
match due to repeat masking); U0 (best match was unique and
exact); U1 (best match was unique, with 1 mismatch); U2 (best
match was unique, with 2 mismatches); R0 (multiple exact
matches found); R1 (multiple 1 mismatch matches found, no
exact matches); R2 (multiple 2 mismatch matches found, no
exact or 1-mismatch matches).alignData
as
nExactMatch
alignData
as nOneMismatch
alignData
as nTwoMismatch
chromosome
position
strand
alignData
, as
NCharacterTreatment
. ‘.’ indicates treatment of
‘N’ was not applicable; ‘D’ indicates treatment
as deletion; ‘|’ indicates treatment as insertionalignData
as
mismatchDetailOne
and mismatchDetailTwo
. Present
only for unique inexact matches at one or two
positions. Position and type of first substituation error,
e.g., 11A represents 11 matches with 12th base an A in
reference but not read. The reference manual cited below lists
only one field (mismatchDetailOne
), but two are present
in files seen in the wild.
type="MAQMap", records=-1L
map
files produced by MAQ. See details in the next section. The
records
option determines how many lines are read;
-1L
(the default) means that all records are input.type="MAQMapShort", records=-1L
type="MAQMap"
but for map files made with Maq prior to
version 0.7.0. (These files use a different maximum read length
[64 instead of 128], and are hence incompatible with newer Maq map
files.)type="MAQMapview"
Parse alignment files created by MAQ's ‘mapiew’ command. Interpretation of columns is based on the description in the MAQ manual, specifically
...each line consists of read name, chromosome, position, strand, insert size from the outer coordinates of a pair, paired flag, mapping quality, single-end mapping quality, alternative mapping quality, number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of the first 24bp on the reference, length of the read, read sequence and its quality.
The read name, read sequence, and quality are read as
XStringSet
objects. Chromosome and strand are read as
factor
s. Position is numeric
, while mapping quality is
numeric
. These fields are mapped to their corresponding
representation in AlignedRead
objects.
Number of mismatches of the best hit, sum of qualities of mismatched
bases of the best hit, number of 0-mismatch hits of the first 24bp,
number of 1-mismatch hits of the first 24bp are represented in the
AlignedRead
object as components of alignData
.
Remaining fields are currently ignored.
type="Bowtie"
Parse alignment files created with the Bowtie alignment
algorithm. Parsed columns can be retrieved from
AlignedRead
as follows:
id
strand
chromosome
position
; see comment belowsread
; see comment belowquality
; see comments belowalignData
This method includes the argument qualityType
to specify
how quality scores are encoded. Bowtie quality scores are
‘Solexa’-like by default, with
qualityType='SFastqQuality'
, but can be specified as
‘Phred’-like, with qualityType='FastqQuality'
.
Bowtie outputs positions that are 0-offset from the left-most end
of the +
strand. ShortRead
parses position
information to be 1-offset from the left-most end of the +
strand.
Bowtie outputs reads aligned to the -
strand as their
reverse complement, and reverses the quality score string of these
reads. ShortRead
parses these to their original sequence
and orientation.
type="SOAP"
Parse alignment files created with the SOAP alignment
algorithm. Parsed columns can be retrieved from
AlignedRead
as follows:
id
sread
; see comment belowquality
; see comment belowalignData
alignData
(pairedEnd
)alignData
(alignedLength
)strand
chromosome
position
; see comment belowalignData
(typeOfHit
: integer
portion; hitDetail
: text portion)
This method includes the argument qualityType
to specify
how quality scores are encoded. It is unclear from SOAP
documentation what the quality score is; the default is
‘Solexa’-like, with qualityType='SFastqQuality'
, but
can be specified as ‘Phred’-like, with
qualityType='FastqQuality'
.
SOAP outputs positions that are 1-offset from the left-most end of
the +
strand. ShortRead
preserves this
representation.
SOAP reads aligned to the -
strand are reported by SOAP as
their reverse complement, with the quality string of these reads
reversed. ShortRead
parses these to their original sequence
and orientation.
A single R object (e.g., AlignedRead
) containing
alignments, sequences and qualities of all files in dirPath
matching pattern
. There is no guarantee of order in which files
are read.
Martin Morgan <mtmorgan@fhcrc.org>, Simon Anders <anders@ebi.ac.uk> (MAQ map)
A AlignedRead
object.
Genome Analyzer Pipeline Software User Guide, Revision A, January 2008.
The MAQ reference manual, http://maq.sourceforge.net/maq-manpage.shtml#5, 3 May, 2008.
The Bowtie reference manual, http://bowtie-bio.sourceforge.net, 28 October, 2008.
The SOAP reference manual, http://soap.genomics.org.cn/soap1, 16 December, 2008.
sp <- SolexaPath(system.file("extdata", package="ShortRead")) ap <- analysisPath(sp) ## ELAND_EXTENDED readAligned(ap, "s_2_export.txt", "SolexaExport") ## PhageAlign readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign") ## MAQ dirPath <- system.file('extdata', 'maq', package='ShortRead') list.files(dirPath) ## First line readLines(list.files(dirPath, full.names=TRUE)[[1]], 1) countLines(dirPath) ## two files collapse into one readAligned(dirPath, type="MAQMapview") ## select only chr1-5.fa, '+' strand filt <- compose(chromosomeFilter("chr[1-5].fa"), strandFilter("+")) readAligned(sp, "s_2_export.txt", filter=filt)