| bc_extract {CellBarcode} | R Documentation |
bc_extract identifies the barcodes (and UMI) from the sequences using
regular expressions. pattern and pattern_type arguments are
necessary, which provide the barcode (and UMI) pattern and their location
within the sequences.
bc_extract( x, pattern = "", sample_name = NULL, metadata = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'data.frame' bc_extract( x, pattern = "", sample_name = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'ShortReadQ' bc_extract( x, pattern = "", sample_name = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'DNAStringSet' bc_extract( x, pattern = "", sample_name = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'integer' bc_extract( x, pattern = "", sample_name = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'character' bc_extract( x, pattern = "", sample_name = NULL, metadata = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'list' bc_extract( x, pattern = "", sample_name = NULL, metadata = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE )
x |
A single or a list of fastq file, ShortReadQ, DNAStringSet, data.frame, or named integer. |
pattern |
A string, specifying the regular expression with capture. It matchs the barcode (and UMI) with capture pattern. |
sample_name |
A string vector, applicable when |
metadata |
A |
maxLDist |
A integer. The mismatch threshold for barcode matching, when
maxLDist is 0, the |
pattern_type |
A vector. It defines the barcode (and UMI) capture group. See Details. |
costs |
A named list, applicable when maxLDist > 0, specifying the
weight of each mismatch events while extracting the barcodes. The list
element name have to be |
ordered |
A logical value. If the value is true, the return barcodes (UMI-barcode tags) are sorted by the reads counts. |
The pattern argument is a regular expression, the capture operation
() identifying the barcode or UMI. pattern_type argument
annotates capture, denoting the UMI or the barcode captured pattern. In the
example:
([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC
|---------| starts with 3 base pairs UMI.
|----------| constant sequence in the backbone.
|-------| flexible barcode sequences.
|---------| 3' constant sequence.
In UMI part [ACGT]{3}, [ACGT] means it can be one of
the "A", "C", "G" and "T", and {3} means it repeats 3 times.
In the barcode pattern [ACGT]+, the + denotes
that there is at least one of the A or C or G or
T.
This function returns a BarcodeObj object if the input is a list or a
vector of Fastq files, otherwise it returns a data.frame. In
the later case
the data.frame has 5 columns:
reads_seq: full sequence.
match_seq: part of the full sequence matched by pattern.
umi_seq (optional): UMI sequence, applicable when there is UMI
in 'pattern' and 'pattern_type' argument.
barcode_seq: barcode sequence.
count: reads number.
The match_seq is part of reads_seq; The umi_seq and
barcode_seq are part of match_seq. The reads_seq is the
full sequence, and is unique id for each record (row), On the contrast,
match_seq, umi_seq or barcode_seq may duplicated between
rows.
fq_file <- system.file("extdata", "simple.fq", package="CellBarcode")
library(ShortRead)
# barcode from fastq file
bc_extract(fq_file, pattern = "AAAAA(.*)CCCCC")
# barcode from ShortReadQ object
sr <- readFastq(fq_file) # ShortReadQ
bc_extract(sr, pattern = "AAAAA(.*)CCCCC")
# barcode from DNAStringSet object
ds <- sread(sr) # DNAStringSet
bc_extract(ds, pattern = "AAAAA(.*)CCCCC")
# barcode from integer vector
iv <- tables(ds, n = Inf)$top # integer vector
bc_extract(iv, pattern = "AAAAA(.*)CCCCC")
# barcode from data.frame
df <- data.frame(seq = names(iv), freq = as.integer(iv)) # data.frame
bc_extract(df, pattern = "AAAAA(.*)CCCCC")
# barcode from list of DNAStringSet
l <- list(sample1 = ds, sample2 = ds) # list
bc_extract(l, pattern = "AAAAA(.*)CCCCC")
# Extract UMI and barcode
d1 <- data.frame(
seq = c(
"ACTTCGATCGATCGAAAAGATCGATCGATC",
"AATTCGATCGATCGAAGAGATCGATCGATC",
"CCTTCGATCGATCGAAGAAGATCGATCGATC",
"TTTTCGATCGATCGAAAAGATCGATCGATC",
"AAATCGATCGATCGAAGAGATCGATCGATC",
"CCCTCGATCGATCGAAGAAGATCGATCGATC",
"GGGTCGATCGATCGAAAAGATCGATCGATC",
"GGATCGATCGATCGAAGAGATCGATCGATC",
"ACTTCGATCGATCGAACAAGATCGATCGATC",
"GGTTCGATCGATCGACGAGATCGATCGATC",
"GCGTCCATCGATCGAAGAAGATCGATCGATC"
),
freq = c(
30, 60, 9, 10, 14, 5, 10, 30, 6, 4 , 6
)
)
# barcode backbone with UMI and barcode
pattern <- "([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC"
bc_extract(
list(test = d1),
pattern,
sample_name=c("test"),
pattern_type=c(UMI=1, barcode=2))
###