duplicateDiscordance {SeqVarTools} | R Documentation |
Find discordance rate for duplicate sample pairs
## S4 method for signature 'SeqVarData,missing' duplicateDiscordance(gdsobj, match.samples.on="subject.id", check.phase=FALSE, verbose=TRUE) ## S4 method for signature 'SeqVarData,SeqVarData' duplicateDiscordance(gdsobj, obj2, match.samples.on=c("subject.id", "subject.id"), match.variants.on=c("alleles", "position"), discordance.type=c("genotype", "hethom"), by.variant=FALSE, verbose=TRUE)
gdsobj |
A |
obj2 |
A |
match.samples.on |
Character string or vector of strings indicating which column should be used for matching samples. See details. |
match.variants.on |
Character string of length one indicating how to match variants. See details. |
discordance.type |
Character string describing how discordances should be calculated. See details. |
check.phase |
A logical indicating whether phase should be considered when calculating discordance. |
by.variant |
Calculate discordance by variant, otherwise by sample |
verbose |
A logical indicating whether to print progress messages. |
For calls that involve only one gds file,
duplicate discordance is calculated by sample pair and by variant. If
there are more than two samples per subject in samples
, only the first two
samples are used and a warning message is printed.
If check.phase=TRUE
, variants with mismatched phase are
considered discordant. If check.phase=FALSE
, phase is ignored.
For calls that involve two gds files,
duplicate discordance is calculated by matching sample pairs and variants between the two data sets.
Only biallelic SNVs are considered in the comparison.
Variants can be matched using chromosome and position only (match.variants.on="position"
) or by using chromosome, position, and alleles (match.variants.on="alleles"
).
If matching on alleles and the reference allele in the first dataset is the alternate allele in the second dataset, the genotype dosage will be recoded so the same allele is counted before making the comparison.
If a variant in one dataset maps to multiple variants in the other dataset, only the first pair is considered for the comparison.
Discordances can be calculated using either genotypes (discordance.type = "genotype"
) or heterozygote/homozygote status (discordance.type = "hethom"
).
The latter is a method to calculate discordance that does not require alleles to be measured on the same strand in both datasets, so it is probably best to also set match.variants.on = "position"
if using the "hethom"
option.
The argument match.samples.on
can be used to select which column in the sampleData
of the input SeqVarData
object should be used for matching samples.
For one gds file, match.samples.on
should be a single string.
For two gds files, match.samples.on
should be a length-2 vector of character strings, where the first element is the column to use for the first gds object and the second element is the column to use for the second gds file.
To exclude certain variants or samples from the calculate, use seqSetFilter
to set appropriate filters on each gds object.
For calls involving one gds file, a list with the following elements:
by.variant |
A data.frame with the number of discordances for each variant, the number of sample pairs with non-missing data, and the discordance rate (num.discord / num.pair). Row names are variant ids. |
by.subject |
A data.frame with the sample ids for each pair, the
number of discordances, the number of non-missing variants, and the
discordance rate (num.discord / num.var). Row.names are subject.id
(as given in |
For calls involving two gds files,
A data frame with the following columns, depending on whether by.variant=TRUE
or FALSE
:
subjectID |
currently, this is the sample ID ( |
sample.id.1/variant.id.1 |
sample id or variant id in the first gds file |
sample.id.2/variant.id.1 |
sample id or variant id in the second gds file |
n.variants/n.samples |
the number of non-missing variants or samples that were compared |
n.concordant |
the number of concordant variants |
n.alt |
the number of variants involving the alternate allele in either sample |
n.alt.conc |
the number of concordant variants invovling the alternate allele in either sample |
n.het.ref |
the number of mismatches where one call is a heterozygote and the other is a reference homozygote |
n.het.alt |
the number of mismatches where one call is a heterozygote and the other is an alternate homozygote |
n.ref.alt |
the number of mismatches where the calls are opposite homozygotes |
Stephanie Gogarten, Adrienne Stilp
require(Biobase) gds <- seqOpen(seqExampleFileName("gds")) ## the example file has one sample per subject, but we ## will match the first four samples into pairs as an example sample.id <- seqGetData(gds, "sample.id") samples <- AnnotatedDataFrame(data.frame(data.frame(subject.id=rep(c("subj1", "subj2"), times=45), sample.id=sample.id, stringsAsFactors=FALSE))) seqData <- SeqVarData(gds, sampleData=samples) # set a filter on the first four samples seqSetFilter(seqData, sample.id=sample.id[1:4]) disc <- duplicateDiscordance(seqData) head(disc$by.variant) disc$by.subject seqClose(gds)