| matchPDict {Biostrings} | R Documentation |
matchPDict efficiently finds all occurences in a text (the
subject) of any pattern from a set of patterns (the dictionary).
The implementation of matchPDict is based on the Aho-Corasick
algorithm and is very fast.
Note that, for now, matchPDict only works with a dictionary
of DNA patterns.
This man page shows how to use matchPDict for exact matching
of a constant width dictionary i.e. a dictionary where all the
patterns have the same length (same number of nucleotides).
See matchPDict-inexact for inexact matching (or for using a variable width dictionary).
matchPDict(pdict, subject, algorithm="auto", max.mismatch=0, fixed=TRUE) countPDict(pdict, subject, algorithm="auto", max.mismatch=0, fixed=TRUE) ## Manipulation of the MIndex object returned by matchPDict() startIndex(x, all.names=FALSE) endIndex(x, all.names=FALSE) countIndex(x, all.names=FALSE) unlist(x, recursive=TRUE, use.names=TRUE) extractAllMatches(subject, mindex)
pdict |
The preprocessed dictionary (see below for the details). |
subject |
An XString object containing the subject string. For now, only XString subjects of subtype DNAString are supported. |
algorithm |
Not supported yet. |
max.mismatch |
The maximum number of mismatching letters allowed (see
isMatching for the details).
When used on a constant width dictionary, matchPDict (and
countPDict) only support exact matching so max.mismatch
must be zero. See matchPDict-inexact for inexact matching.
|
fixed |
If FALSE then IUPAC extended letters are interpreted as ambiguities
(see isMatching for the details).
When used on a constant width dictionary, matchPDict (and
countPDict) only support exact matching so fixed
must be TRUE. See matchPDict-inexact for inexact matching.
|
x |
A PDict or MIndex object for names.
An MIndex object for the other methods.
|
all.names |
TRUE or FALSE.
|
recursive |
ignored. |
use.names |
ignored. |
mindex |
An MIndex object returned by a previous call to matchPDict.
|
matchPDict returns an MIndex object.
countPDict returns an integer vector.
startIndex, endIndex and countIndex return vectors of the same
length as the input dictionary: startIndex and endIndex return a
list of integer vectors, and countIndex returns an integer vector.
extractAllMatches returns an XStringViews object with names.
H. Pages
PDict,
matchPDict-inexact,
isMatching,
coverage,
matchPattern,
alphabetFrequency,
XStringViews-class,
DNAString-class
## ---------------------------------------------------------------------
## A. WITH UNNAMED PATTERNS
## ---------------------------------------------------------------------
## Creating the pattern dictionary
library(drosophila2probe)
dict0 <- drosophila2probe$sequence # The input dictionary.
length(dict0) # Hundreds of thousands of patterns.
pdict0 <- PDict(dict0) # Store the input dictionary into a
# PDict object (preprocessing).
## Using the pattern dictionary on chromosome 3R
library(BSgenome.Dmelanogaster.UCSC.dm3)
chr3R <- Dmelanogaster$chr3R # Load chromosome 3R
chr3R
mindex <- matchPDict(pdict0, chr3R) # Search...
## Looking at the matches
start_index <- startIndex(mindex) # Get the start index.
length(start_index) # Same as the input dictionary.
start_index[[8220]] # Starts of the 8220th pattern.
end_index <- endIndex(mindex) # Get the end index.
end_index[[8220]] # Ends of the 8220th pattern.
count_index <- countIndex(mindex) # Get the number of matches per pattern.
count_index[[8220]]
mindex[[8220]] # Get the matches for the 8220th pattern.
start(mindex[[8220]]) # Equivalent to startIndex(mindex)[[8220]].
sum(count_index) # Total number of matches.
table(count_index)
i0 <- which(count_index == max(count_index))
dict0[i0] # The pattern with most occurences.
mindex[[i0]] # Its matches as a Views object.
views(chr3R, start_index[[i0]], end_index[[i0]]) # And as an XStringViews object.
## Get the coverage of the original subject
cov3R <- coverage(mindex, 1, length(chr3R))
max(cov3R)
mean(cov3R)
sum(cov3R != 0) / length(cov3R) # Only 2.44
if (interactive()) {
plotCoverage <- function(coverage, start, end)
{
plot.new()
plot.window(c(start, end), c(0, 20))
axis(1)
axis(2)
axis(4)
lines(start:end, coverage[start:end], type="l")
}
plotCoverage(cov3R, 27600000, 27900000)
}
## ---------------------------------------------------------------------
## B. WITH NAMED PATTERNS
## ---------------------------------------------------------------------
dict1 <- dict0[8211:8236] # The input dictionary.
names(dict1) <- LETTERS
dict1[1:5]
pdict1 <- PDict(dict1)
length(pdict1) # Same as the input dictionary.
names(pdict1) # Same as names(dict1).
mindex <- matchPDict(pdict1, chr3R) # Search...
## Looking at the matches
names(mindex) # Same as names(dict1).
start_index <- startIndex(mindex)
start_index
length(start_index) # NOT the same as the input dictionary.
start_index <- startIndex(mindex, all.names=TRUE)
length(start_index) # Same as the input dictionary.
countIndex(mindex)
unlist(mindex)
length(unlist(mindex)) # Total number of matches.
all_matches <- extractAllMatches(chr3R, mindex)
all_matches
names(all_matches)
## ---------------------------------------------------------------------
## C. PERFORMANCE
## ---------------------------------------------------------------------
## If getting the number of matches is what matters only (without regarding
## their positions), then countPDict() will be faster, especially when there
## is a high number of matches
count_index1 <- countPDict(pdict1, chr3R)
identical(count_index1, count_index) # TRUE
if (interactive()) {
## What's the impact of the dictionary width on performance?
## Below is some code that can be used to figure out (will take a long
## time to run). For different widths of the input dictionary, we look at:
## o pptime: preprocessing time (in sec.) i.e. time needed for building
## the PDict object from the truncated input sequences
## o nnodes: nb of nodes in the resulting Aho-Corasick tree
## o nupatt: nb of unique truncated input sequences
## o matchtime: time (in sec.) needed to find all the matches
## o totalcount: total number of matches
getPDictStats <- function(dict0, width, subject)
{
list(
width=width,
pptime=system.time(pdict <- PDict(dict0, tb.end=width, drop.tail=TRUE))[["elapsed"]],
nnodes=length(pdict@actree),
nupatt=sum(pdict@dups == 0),
matchtime=system.time(mindex <- matchPDict(pdict, subject))[["elapsed"]],
totalcount=sum(countIndex(mindex))
)
}
stats <- lapply(6:25, function(width) getPDictStats(dict0, width, chr3R))
stats <- data.frame(do.call(rbind, stats))
stats
}