| matchPDict {Biostrings} | R Documentation |
A set of functions for finding all the occurences (aka "matches" or "hits") of a set of patterns (aka the dictionary) in a reference sequence or set of reference sequences (aka the subject)
The following functions differ in what they return: matchPDict
returns the "where" information i.e. the positions in the subject of all the
occurrences of every pattern; countPDict returns the "how many
times" information i.e. the number of occurrences for each pattern;
and whichPDict returns the "who" information i.e. which patterns
in the preprocessed dictionary have at least one match.
vcountPDict is similar to countPDict but it works on
a set of reference sequences in a vectorized fashion.
This man page shows how to use these functions for exact matching of a constant width dictionary i.e. a dictionary where all the patterns have the same length (same number of nucleotides).
See ?`matchPDict-inexact` for how to use these functions
for inexact matching or when the original dictionary has a variable width.
matchPDict(pdict, subject, algorithm="auto",
max.mismatch=0, fixed=TRUE, verbose=FALSE)
countPDict(pdict, subject, algorithm="auto",
max.mismatch=0, fixed=TRUE, verbose=FALSE)
whichPDict(pdict, subject, algorithm="auto",
max.mismatch=0, fixed=TRUE, verbose=FALSE)
vcountPDict(pdict, subject, algorithm="auto",
max.mismatch=0, fixed=TRUE,
collapse=FALSE, weight=1L, verbose=FALSE)
pdict |
A PDict object containing the preprocessed dictionary. |
subject |
An XString or MaskedXString object containing the
subject sequence for matchPDict, countPDict and whichPDict.
An XStringSet object containing the subject sequences for vcountPDict.
For now, only subjects of base class DNAString are supported. |
algorithm |
Not supported yet. |
max.mismatch |
The maximum number of mismatching letters allowed (see
?isMatching for the details).
This man page focuses on exact matching of a constant width
dictionary so max.mismatch=0 in the examples below.
See ?`matchPDict-inexact` for inexact matching.
|
fixed |
If FALSE then IUPAC extended letters are interpreted as ambiguities
(see ?isMatching for the details).
This man page focuses on exact matching of a constant width
dictionary so fixed=TRUE in the examples below.
See ?`matchPDict-inexact` for inexact matching.
|
verbose |
TRUE or FALSE.
|
collapse,weight |
collapse must be FALSE, 1, or 2.
If collapse=FALSE (the default), then weight is ignored
and vcountPDict returns the full matrix of counts (M0).
If collapse=1, then M0 is collapsed "horizontally"
i.e. it is turned into a vector with length equal to
length(pdict).
If weight=1L (the default), then this vector is defined by
rowSums(M0).
If collapse=2, then M0 is collapsed "vertically"
i.e. it is turned into a vector with length equal to
length(subject).
If weight=1L (the default), then this vector is defined by
colSums(M0).
If collapse=1 or collapse=2, then the elements in
subject (collapse=1) or in pdict (collapse=2)
can be weighted thru the weight argument.
In that case, the returned vector is defined by
M0 %*% rep(weight, length.out=length(subject))
and rep(weight, length.out=length(pdict)) %*% M0,
respectively.
|
In this man page, we assume that you know how to preprocess a dictionary
of DNA patterns that can then be used with matchPDict,
countPDict, whichPDict or vcountPDict.
Please see ?PDict if you don't.
When using matchPDict, countPDict, whichPDict
or vcountPDict for exact matching of a constant width dictionary,
the standard way to preprocess the original dictionary is by calling
the PDict constructor on it with no extra arguments.
This returns the preprocessed dictionary in a PDict object
that can be used with any of the functions described here.
If M denotes the number of patterns in the pdict
argument (M <- length(pdict)), then matchPDict returns
an MIndex object of length M,
and countPDict an integer vector of length M.
whichPDict returns an integer vector made of the indices of the
patterns in the pdict argument that have at least one match.
If N denotes the number of sequences in the subject
argument (N <- length(subject)), then vcountPDict
returns an integer matrix with M rows and N columns,
unless the collapse argument is used. In that case, depending
on the type of weight, an integer or numeric vector is returned
(see above for the details).
H. Pages
Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333-340.
PDict-class,
MIndex-class,
matchPDict-inexact,
isMatching,
coverage,MIndex-method,
matchPattern,
alphabetFrequency,
DNAString-class,
DNAStringSet-class,
XStringViews-class,
MaskedDNAString-class
## ---------------------------------------------------------------------
## A. A SIMPLE EXAMPLE OF EXACT MATCHING
## ---------------------------------------------------------------------
## Creating the pattern dictionary:
library(drosophila2probe)
dict0 <- DNAStringSet(drosophila2probe$sequence)
dict0 # The original dictionary.
length(dict0) # Hundreds of thousands of patterns.
pdict0 <- PDict(dict0) # Store the original dictionary in
# a PDict object (preprocessing).
## Using the pattern dictionary on chromosome 3R:
library(BSgenome.Dmelanogaster.UCSC.dm3)
chr3R <- Dmelanogaster$chr3R # Load chromosome 3R
chr3R
mi0 <- matchPDict(pdict0, chr3R) # Search...
## Looking at the matches:
start_index <- startIndex(mi0) # Get the start index.
length(start_index) # Same as the original dictionary.
start_index[[8220]] # Starts of the 8220th pattern.
end_index <- endIndex(mi0) # Get the end index.
end_index[[8220]] # Ends of the 8220th pattern.
count_index <- countIndex(mi0) # Get the number of matches per pattern.
count_index[[8220]]
mi0[[8220]] # Get the matches for the 8220th pattern.
start(mi0[[8220]]) # Equivalent to startIndex(mi0)[[8220]].
sum(count_index) # Total number of matches.
table(count_index)
i0 <- which(count_index == max(count_index))
pdict0[[i0]] # The pattern with most occurrences.
mi0[[i0]] # Its matches as an IRanges object.
Views(chr3R, mi0[[i0]]) # And as an XStringViews object.
## Get the coverage of the original subject:
cov3R <- as.integer(coverage(mi0, width=length(chr3R)))
max(cov3R)
mean(cov3R)
sum(cov3R != 0) / length(cov3R) # Only 2.44% of chr3R is covered.
if (interactive()) {
plotCoverage <- function(cx, start, end)
{
plot.new()
plot.window(c(start, end), c(0, 20))
axis(1)
axis(2)
axis(4)
lines(start:end, cx[start:end], type="l")
}
plotCoverage(cov3R, 27600000, 27900000)
}
## ---------------------------------------------------------------------
## B. NAMING THE PATTERNS
## ---------------------------------------------------------------------
## The names of the original patterns, if any, are propagated to the
## PDict and MIndex objects:
names(dict0) <- mkAllStrings(letters, 4)[seq_len(length(dict0))]
dict0
dict0[["abcd"]]
pdict0n <- PDict(dict0)
names(pdict0n)[1:30]
pdict0n[["abcd"]]
mi0n <- matchPDict(pdict0n, chr3R)
names(mi0n)[1:30]
mi0n[["abcd"]]
## This is particularly useful when unlisting an MIndex object:
unlist(mi0)[1:10]
unlist(mi0n)[1:10] # keep track of where the matches are coming from
## ---------------------------------------------------------------------
## C. PERFORMANCE
## ---------------------------------------------------------------------
## If getting the number of matches is what matters only (without
## regarding their positions), then countPDict() will be faster,
## especially when there is a high number of matches:
count_index0 <- countPDict(pdict0, chr3R)
identical(count_index0, count_index) # TRUE
if (interactive()) {
## What's the impact of the dictionary width on performance?
## Below is some code that can be used to figure out (will take a long
## time to run). For different widths of the original dictionary, we
## look at:
## o pptime: preprocessing time (in sec.) i.e. time needed for
## building the PDict object from the truncated input
## sequences;
## o nnodes: nb of nodes in the resulting Aho-Corasick tree;
## o nupatt: nb of unique truncated input sequences;
## o matchtime: time (in sec.) needed to find all the matches;
## o totalcount: total number of matches.
getPDictStats <- function(dict, subject)
{
ans_width <- width(dict[1])
ans_pptime <- system.time(pdict <- PDict(dict))[["elapsed"]]
pptb <- pdict@threeparts@pptb
ans_nnodes <- length(pptb@nodes) %/%
Biostrings:::.ACtree.ints_per_acnode(pptb)
ans_nupatt <- sum(!duplicated(pdict))
ans_matchtime <- system.time(
mi0 <- matchPDict(pdict, subject)
)[["elapsed"]]
ans_totalcount <- sum(countIndex(mi0))
list(
width=ans_width,
pptime=ans_pptime,
nnodes=ans_nnodes,
nupatt=ans_nupatt,
matchtime=ans_matchtime,
totalcount=ans_totalcount
)
}
stats <- lapply(6:25,
function(width)
getPDictStats(DNAStringSet(dict0, end=width), chr3R))
stats <- data.frame(do.call(rbind, stats))
stats
}
## ---------------------------------------------------------------------
## D. vcountPDict()
## ---------------------------------------------------------------------
subject <- Dmelanogaster$upstream1000[1:200]
subject
mat1 <- vcountPDict(pdict0, subject)
dim(mat1) # length(pdict0) x length(subject)
nhit_per_probe <- rowSums(mat1)
table(nhit_per_probe)
## Without vcountPDict(), 'mat1' could have been computed with:
mat2 <- sapply(unname(subject), function(x) countPDict(pdict0, x))
identical(mat1, mat2) # TRUE
## but using vcountPDict() is faster (10x or more, depending of the
## average length of the sequences in 'subject').
if (interactive()) {
## This will fail (with message "allocMatrix: too many elements
## specified") because, on most platforms, vectors and matrices in R
## are limited to 2^31 elements:
subject <- Dmelanogaster$upstream1000
vcountPDict(pdict0, subject)
length(pdict0) * length(Dmelanogaster$upstream1000)
1 * length(pdict0) * length(Dmelanogaster$upstream1000) # > 2^31
## But this will work:
nhit_per_seq <- vcountPDict(pdict0, subject, collapse=2)
sum(nhit_per_seq >= 1) # nb of subject sequences with at least 1 hit
table(nhit_per_seq)
which(nhit_per_seq == 37) # 603
sum(countPDict(pdict0, subject[[603]])) # 37
}
## ---------------------------------------------------------------------
## E. RELATIONSHIP BETWEEN vcountPDict(), countPDict() AND
## vcountPattern()
## ---------------------------------------------------------------------
dict3 <- DNAStringSet(mkAllStrings(DNA_BASES, 3)) # all trinucleotides
dict3
pdict3 <- PDict(dict3)
subject <- Dmelanogaster$upstream1000
subject
## The 3 following calls are equivalent (from faster to slower):
mat3a <- vcountPDict(pdict3, subject)
mat3b <- sapply(dict3, function(pattern) vcountPattern(pattern, subject))
mat3c <- sapply(unname(subject), function(x) countPDict(pdict3, x))
stopifnot(identical(mat3a, t(mat3b)))
stopifnot(identical(mat3a, mat3c))
## The 2 following calls are equivalent (from faster to slower):
nhitpp3a <- vcountPDict(pdict3, subject, collapse=1) # rowSums(mat3a)
nhitpp3b <- sapply(dict3, function(pattern) sum(vcountPattern(pattern, subject)))
stopifnot(identical(nhitpp3a, nhitpp3b))
## The 2 following calls are equivalent (from faster to slower):
nhitps3a <- vcountPDict(pdict3, subject, collapse=2) # colSums(mat3a)
nhitps3b <- sapply(unname(subject), function(x) sum(countPDict(pdict3, x)))
stopifnot(identical(nhitps3a, nhitps3b))