| getSeq {BSgenome} | R Documentation |
A convenience function for extracting a set of sequences (or subsequences) from a BSgenome or other object. This man page specifically documents the BSgenome method.
getSeq(x, ...) ## S4 method for signature 'BSgenome': getSeq(x, names, start=NA, end=NA, width=NA, strand="+", as.character=TRUE)
x |
A BSgenome object.
See the available.genomes function for how to install
a genome.
|
names |
The names of the sequences to extract from x.
If missing, then seqnames(x) is used.
See ?seqnames and ?mseqnames to get
the list of single sequences and multiple sequences (respectively)
contained in x.
Here is how the lookup between the names passed to the names
argument and the sequences in x is performed.
For each name in names:
(1) if x contains a single sequence with that name
then this sequence is returned;
(2) otherwise the names of all the elements in all the multiple
sequences are searched: name is treated as a regular
expression and grep is used for this search.
If exactly one sequence is found, then it's returned, otherwise an
error is raised.
|
start, end, width |
Vector of integers (eventually with NAs). |
strand |
A vector containing +s or/and -s.
|
as.character |
TRUE or FALSE. Should the extracted sequences
be returned in a standard character vector?
|
The names, start, end, width and strand
arguments are expanded cyclically to the length of the longest provided
none are of zero length.
A standard character vector when as.character=TRUE.
Note that when as.character=TRUE, then the masks that
are defined on top of the sequences to extract are ignored
(i.e. dropped) if any (see ?`MaskedXString-class`
for more information about masked sequences).
A DNAString or MaskedDNAString
object when as.character=FALSE.
Note that as.character=FALSE is not supported yet when extracting
more than one sequence.
Be aware that using as.character=TRUE can be very inefficient
when the returned character vector contains very long strings
(> 1 million letters) or is itself a long vector (> 10000 strings).
getSeq is much more efficient when used with
as.character=FALSE but this works only for extracting
one sequence at a time for now.
H. Pages; improvements suggested by Matt Settles
available.genomes,
BSgenome-class,
seqnames,
mseqnames,
grep,
subseq,
DNAString,
MaskedDNAString,
[[,BSgenome-method
# Load the Caenorhabditis elegans genome (UCSC Release ce2):
library(BSgenome.Celegans.UCSC.ce2)
# Look at the index of sequences:
Celegans
# Get chromosome V as a DNAString object:
getSeq(Celegans, "chrV", as.character=FALSE)
# which is in fact the same as doing:
Celegans$chrV
# Never try this:
#getSeq(Celegans, "chrV")
# or this (even worse):
#getSeq(Celegans)
# Get the first 20 bases of each chromosome:
getSeq(Celegans, end=20)
# Get the last 20 bases of each chromosome:
getSeq(Celegans, start=-20)
# Extracting small sequences from different chromosomes:
myseqs <- data.frame(
chr=c("chrI", "chrX", "chrM", "chrM", "chrX", "chrI", "chrM", "chrI"),
start=c(NA, -40, 8510, 301, 30001, 9220500, -2804, -30),
end=c(50, NA, 8522, 324, 30011, 9220555, -2801, -11),
strand=c("+", "-", "+", "+", "-", "-", "+", "-")
)
getSeq(Celegans, myseqs$chr,
start=myseqs$start, end=myseqs$end)
getSeq(Celegans, myseqs$chr,
start=myseqs$start, end=myseqs$end, strand=myseqs$strand)
# Get the "NM_058280_up_1000" sequence (belongs to the upstream1000
# multiple sequence) as a character string:
s1 <- getSeq(Celegans, "NM_058280_up_1000")
# or a DNAString object (more efficient):
s2 <- getSeq(Celegans, "NM_058280_up_1000", as.character=FALSE)
getSeq(Celegans, "NM_058280_up_5000", start=-1000) == s1 # TRUE
getSeq(Celegans, "NM_058280_up_5000",
start=-1000, as.character=FALSE) == s2 # TRUE