Author: Valerie Obenchain

Annotating Genomic Ranges

Background

Bioconductor can import diverse sequence-related file types, including fasta, fastq, BAM, VCF, gff, bed, and wig files, among others. Packages support common and advanced sequence manipulation operations such as trimming, transformation, and alignment. Domain-specific analyses include quality assessment, ChIP-seq, differential expression, RNA-seq, and other approaches. Bioconductor includes an interface to the Sequence Read Archive (via the SRAdb package).

This workflow walks through the annotation of a generic set of ranges with Bioconductor packages. The ranges can be any user-defined region of interest or can be from a public file.

Data Preparation

As a first step, data are put into a GRanges object so we can take advantage of overlap operations and store identifiers as metadata columns.

The first set of ranges are variants from a dbSNP Variant Call Format (VCF) file. This file can be downloaded from the ftp site at NCBI ftp://ftp.ncbi.nlm.nih.gov/snp/ and imported with readVcf() from the VariantAnnotation package. Alternatively, the file is available as a pre-parsed VCF object in the AnnotationHub.

library(VariantAnnotation)
library(AnnotationHub)

The Hub returns a VcfFile object with a reference to the file on disk.

hub <- AnnotationHub()
fl <- hub[['AH47004']]
fl
## class: VcfFile 
## path: /var/lib/jenkins/.AnnotationHub/52446
## index: /var/lib/jenkins/.AnnotationHub/52447
## isOpen: FALSE 
## yieldSize: NA

Read the data into a VCF object:

vcf <- readVcf(fl, "hg19")
dim(vcf)
## [1] 114699      0

Overlap operations require that seqlevels and the genome of the objects match. Here we modify the VCF seqlevels to match the TxDb.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
head(seqlevels(txdb_hg19))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
seqlevels(vcf)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "X"  "Y"  "MT"
seqlevels(vcf) <- paste0("chr", seqlevels(vcf))

Sanity check to confirm we have matching seqlevels.

intersect(seqlevels(txdb_hg19), seqlevels(vcf))
##  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8" 
##  [9] "chr9"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
## [17] "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"

The genomes already match so no change is needed.

unique(genome(txdb_hg19))
## [1] "hg19"
unique(genome(vcf))
## [1] "hg19"

We are only interested in the standard chromosomes so we drop the rest.

txdb_hg19 <- keepStandardChromosomes(txdb_hg19)
vcf <- keepStandardChromosomes(vcf)

The GRanges in a VCF object is extracted with 'rowRanges()'.

gr_hg19 <- rowRanges(vcf)

The second set of ranges is a user-defined region of chromosome 4 in mouse. The idea here is that any region, known or unknown, can be annotated with the following steps.

Load the TxDb package and keep only the standard chromosomes.

library(TxDb.Mmusculus.UCSC.mm10.ensGene)
txdb_mm10 <- keepStandardChromosomes(TxDb.Mmusculus.UCSC.mm10.ensGene)

We are creating the GRanges from scratch and can specify the seqlevels (chromosome names) to match the TxDb.

head(seqlevels(txdb_mm10))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
gr_mm10 <- GRanges("chr4", IRanges(c(4000000, 107889000), width=1000))

Now assign the genome.

unique(genome(txdb_mm10))
## [1] "mm10"
genome(gr_mm10) <- "mm10"

Location in and Around Genes

locateVariants() in the VariantAnnotation package annotates ranges with transcript, exon, cds and gene ID's from a TxDb. Various extractions are performed on the TxDb (exonsBy(), transcripts(), cdsBy(), etc.) and the result is overlapped with the ranges. An appropriate GRangesList can also be supplied as the annotation. Different variants such as 'coding', 'fiveUTR', 'threeUTR', 'spliceSite', 'intron', 'promoter', and 'intergenic' can be searched for by passing the appropriate constructor as the 'region' argument. See ?locateVariants for details.

loc_hg19 <- locateVariants(gr_hg19, txdb_hg19, AllVariants())
table(loc_hg19$LOCATION)
## 
## spliceSite     intron    fiveUTR   threeUTR     coding intergenic 
##        311     209819       1276       4933      11383      48467 
##   promoter 
##      12440
loc_mm10 <- locateVariants(gr_mm10, txdb_mm10, AllVariants()) 
table(loc_mm10$LOCATION)
## 
## spliceSite     intron    fiveUTR   threeUTR     coding intergenic 
##          6          1          0          0          0          0 
##   promoter 
##         12

Annotate by ID

The ID's returned from locateVariants() can be used in select() to map to ID's in other annotation packages.

library(org.Hs.eg.db)
cols <- c("UNIPROT", "PFAM")
keys <- na.omit(unique(loc_hg19$GENEID))
head(select(org.Hs.eg.db, keys, cols, keytype="ENTREZID"))
##   ENTREZID UNIPROT    PFAM
## 1     9636  P05161 PF00240
## 2   375790  O00468 PF00008
## 3   375790  O00468 PF00050
## 4   375790  O00468 PF00053
## 5   375790  O00468 PF00054
## 6   375790  O00468 PF01390

The 'keytype' argument specifies that the mouse TxDb contains Ensembl instead of Entrez gene id's.

library(org.Mm.eg.db)
keys <- unique(loc_mm10$GENEID)
head(select(org.Mm.eg.db, keys, cols, keytype="ENSEMBL"))
##              ENSEMBL UNIPROT    PFAM
## 1 ENSMUSG00000028236  Q7TQA3 PF00106
## 2 ENSMUSG00000028608  Q8BHG2 PF05907

Annotate by Position

Files stored in the AnnotationHub have been pre-processed into ranged-based R objects such as a GRanges, GAlignments and VCF. The positions in our GRanges can be overlapped with the ranges in the AnnotationHub files. This allows for easy subsetting of multiple files, resulting in only the ranges of interest.

Create a 'hub' from AnnotationHub and filter the files based on organism and genome build.

hub <- AnnotationHub()
hub_hg19 <- subset(hub, 
                  (hub$species == "Homo sapiens") & (hub$genome == "hg19"))
length(hub_hg19)
## [1] 5539

Iterate over the first 3 files and extract ranges that overlap 'gr_hg19'.

ov_hg19 <- lapply(1:3, function(i) subsetByOverlaps(hub_hg19[[i]], gr_hg19))

Inspect the results.

names(ov_hg19) <- names(hub_hg19)[1:3]
lapply(ov_hg19, head, n=3)
## $AH3166
## GRanges object with 3 ranges and 5 metadata columns:
##       seqnames                 ranges strand |
##          <Rle>              <IRanges>  <Rle> |
##   [1]    chr14 [ 23388231,  23388425]      - |
##   [2]    chr11 [118436595, 118436793]      - |
##   [3]     chr7 [141438132, 141438281]      + |
##                                            name     score     level
##                                     <character> <integer> <numeric>
##   [1]   chr14:23388231:23388425:-:1.08:0.993703         0  370.1525
##   [2] chr11:118436595:118436793:-:2.21:0.999420         0  255.1222
##   [3]  chr7:141438132:141438281:+:0.37:0.999969         0  197.9053
##          signif    score2
##       <numeric> <integer>
##   [1]  5.15e-09         0
##   [2]  8.41e-09         0
##   [3]  9.31e-09         0
##   -------
##   seqinfo: 24 sequences from hg19 genome
## 
## $AH3912
## GRanges object with 3 ranges and 5 metadata columns:
##       seqnames           ranges strand |        name     score signalValue
##          <Rle>        <IRanges>  <Rle> | <character> <integer>   <numeric>
##   [1]     chr1 [948050, 950479]      * |           .         0   425.73300
##   [2]     chr1 [954433, 956440]      * |           .         0   107.74700
##   [3]     chr1 [970635, 971318]      * |           .         0     2.16979
##          pValue    qValue
##       <numeric> <numeric>
##   [1] 324.00000        -1
##   [2] 324.00000        -1
##   [3]   1.63691        -1
##   -------
##   seqinfo: 23 sequences from hg19 genome
## 
## $AH3913
## GRanges object with 3 ranges and 6 metadata columns:
##       seqnames             ranges strand |        name     score
##          <Rle>          <IRanges>  <Rle> | <character> <integer>
##   [1]     chr1 [1048860, 1049010]      * |           .         0
##   [2]     chr1 [3838920, 3839070]      * |           .         0
##   [3]     chr1 [6051680, 6051830]      * |           .         0
##       signalValue    pValue    qValue      peak
##         <numeric> <numeric> <numeric> <integer>
##   [1]          60   21.9043        -1        -1
##   [2]         451  324.0000        -1        -1
##   [3]         107  321.6620        -1        -1
##   -------
##   seqinfo: 23 sequences from hg19 genome

Annotating the mouse ranges in the same fashion is left as an exercise.

Annotating Variants

Amino acid coding changes

For the set of dbSNP variants that fall in coding regions, amino acid changes can be computed. The output contains one line for each variant-transcript match which can result in multiple lines for each variant.

library(BSgenome.Hsapiens.UCSC.hg19)
head(predictCoding(vcf, txdb_hg19, Hsapiens), 3)
## GRanges object with 3 ranges and 17 metadata columns:
##               seqnames             ranges strand | paramRangeID
##                  <Rle>          <IRanges>  <Rle> |     <factor>
##   rs397514721     chr1 [1233203, 1233203]      - |         <NA>
##   rs397514721     chr1 [1233203, 1233203]      - |         <NA>
##   rs397514721     chr1 [1233203, 1233203]      - |         <NA>
##                          REF                ALT      QUAL      FILTER
##               <DNAStringSet> <DNAStringSetList> <numeric> <character>
##   rs397514721              T                  A      <NA>           .
##   rs397514721              T                  A      <NA>           .
##   rs397514721              T                  A      <NA>           .
##                    varAllele       CDSLOC    PROTEINLOC   QUERYID
##               <DNAStringSet>    <IRanges> <IntegerList> <integer>
##   rs397514721              T [ 317,  317]           106        49
##   rs397514721              T [1001, 1001]           334        49
##   rs397514721              T [1127, 1127]           376        49
##                      TXID         CDSID      GENEID   CONSEQUENCE
##               <character> <IntegerList> <character>      <factor>
##   rs397514721        4150         12185      116983 nonsynonymous
##   rs397514721        4151         12185      116983 nonsynonymous
##   rs397514721        4152         12185      116983 nonsynonymous
##                     REFCODON       VARCODON         REFAA         VARAA
##               <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
##   rs397514721            GAG            GTG             E             V
##   rs397514721            GAG            GTG             E             V
##   rs397514721            GAG            GTG             E             V
##   -------
##   seqinfo: 24 sequences from hg19 genome; no seqlengths
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu precise (12.04.4 LTS)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0      
##  [2] BSgenome_1.36.1                        
##  [3] rtracklayer_1.28.5                     
##  [4] org.Mm.eg.db_3.1.2                     
##  [5] org.Hs.eg.db_3.1.2                     
##  [6] RSQLite_1.0.0                          
##  [7] DBI_0.3.1                              
##  [8] TxDb.Mmusculus.UCSC.mm10.ensGene_3.1.2 
##  [9] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
## [10] GenomicFeatures_1.20.1                 
## [11] AnnotationDbi_1.30.1                   
## [12] Biobase_2.28.0                         
## [13] AnnotationHub_2.0.2                    
## [14] VariantAnnotation_1.14.3               
## [15] Rsamtools_1.20.4                       
## [16] Biostrings_2.36.1                      
## [17] XVector_0.8.0                          
## [18] GenomicRanges_1.20.5                   
## [19] GenomeInfoDb_1.4.1                     
## [20] IRanges_2.2.4                          
## [21] S4Vectors_0.6.0                        
## [22] BiocGenerics_0.14.0                    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.6                  BiocInstaller_1.18.3        
##  [3] formatR_1.2                  futile.logger_1.4.1         
##  [5] bitops_1.0-6                 futile.options_1.0.0        
##  [7] tools_3.2.0                  zlibbioc_1.14.0             
##  [9] biomaRt_2.24.0               digest_0.6.8                
## [11] evaluate_0.7                 shiny_0.12.1                
## [13] stringr_1.0.0                httr_0.6.1                  
## [15] knitr_1.10.5                 R6_2.0.1                    
## [17] XML_3.98-1.2                 BiocParallel_1.2.5          
## [19] lambda.r_1.1.7               magrittr_1.5                
## [21] htmltools_0.2.6              GenomicAlignments_1.4.1     
## [23] mime_0.3                     interactiveDisplayBase_1.6.0
## [25] xtable_1.7-4                 httpuv_1.3.2                
## [27] stringi_0.4-1                RCurl_1.95-4.6

Exercises

Exercise 1: VCF header and reading data subsets.

VCF files can be large and it's often the case that only a subset of variables or genomic positions are of interest. The scanVcfHeader() function in the VariantAnnotation package retrieves header information from a VCF file. Based on the information returned from scanVcfHeader() a ScanVcfParam() object can be created to read in a subset of data from a VCF file.

Exercise 2: Annotate the mouse ranges in 'gr_mm10' with AnnotationHub files.

Exercise 3: Annotate a gene range from Saccharomyces Scerevisiae.

[ Back to top ]