| getSeq {BSgenome} | R Documentation |
A convenience function for extracting a set of sequences (or subsequences) from a BSgenome object.
getSeq(bsgenome, names, start=NA, end=NA, width=NA, as.character=TRUE)
bsgenome |
A BSgenome object.
See the available.genomes function for how to install
a genome.
|
names |
The names of the sequences to extract from bsgenome.
If missing, then seqnames(bsgenome) is used.
See ?seqnames and ?mseqnames to get
the list of single sequences and multiple sequences (respectively)
contained in bsgenome.
Here is how the lookup between the names passed to the names
argument and the sequences in bsgenome is performed.
For each name in names:
(1) if bsgenome contains a single sequence with that name
then this sequence is returned;
(2) otherwise the names of all the elements in all the multiple
sequences are searched: name is treated as a regular
expression and grep is used for this search.
If exactly one sequence is found, then it's returned, otherwise an
error is raised.
|
start, end, width |
Specify these arguments only if you don't want to extract the
entire sequences.
Then the subsequences specified by start, end
and width (single integers or NAs) will be extracted
by a call to subseq before they are
returned by getSeq.
|
as.character |
TRUE or FALSE. Should the extracted sequences
be returned in a standard character vector?
|
A standard character vector when as.character=TRUE.
Note that when as.character=TRUE, then the masks that
are defined on top of the sequences to extract are ignored if
any (see ?`MaskedXString-class`
for more information about masked sequences).
A DNAString or MaskedDNAString
object when as.character=FALSE.
Note that as.character=FALSE is not supported when more
than one sequence name is supplied.
Be aware that using as.character=TRUE can be very inefficient
when the returned character vector contains very long strings
(> 1 million letters) or is itself a long vector (> 10000 strings).
getSeq is much more efficient when used with
as.character=FALSE but this works only for extracting
one sequence at a time for now.
H. Pages; improvements suggested by Matt Settles
available.genomes,
BSgenome-class,
seqnames,
mseqnames,
grep,
subseq,
DNAString,
MaskedDNAString,
[[,BSgenome-method
# Load the Caenorhabditis elegans genome (UCSC Release ce2):
library(BSgenome.Celegans.UCSC.ce2)
# Look at the index of sequences:
Celegans
# Get chromosome V as a DNAString object:
getSeq(Celegans, "chrV", as.character=FALSE)
# which is in fact the same as doing:
Celegans$chrV
# Never try this:
#getSeq(Celegans, "chrV")
# or this (even worse):
#getSeq(Celegans)
# Get the first 20 bases of each chromosome:
getSeq(Celegans, end=20)
# Get the last 20 bases of each chromosome:
getSeq(Celegans, start=-20)
# Get the "NM_058280_up_1000" sequence (belongs to the upstream1000
# multiple sequence) as a character string:
s1 <- getSeq(Celegans, "NM_058280_up_1000")
# or a DNAString object (more efficient):
s2 <- getSeq(Celegans, "NM_058280_up_1000", as.character=FALSE)
getSeq(Celegans, "NM_058280_up_5000", start=-1000) == s1 # TRUE
getSeq(Celegans, "NM_058280_up_5000",
start=-1000, as.character=FALSE) == s2 # TRUE