
Visualize Categorical Signals
========================================

**Author**: Zuguang Gu ( z.gu@dkfz.de )

**Date**: `r Sys.Date()`

-------------------------------------------------------------

```{r, echo = FALSE, message = FALSE}
library(knitr)
knitr::opts_chunk$set(
    error = FALSE,
    warning = FALSE,
    message = FALSE)
library(markdown)
options(markdown.HTML.options = setdiff(options('markdown.HTML.options')[[1]], "toc"))
options(markdown.HTML.stylesheet = "custom.css")
options(width = 100)
```

Normally genomic signals are represented as numeric (e.g. methylation rate
from WGBS or histone modification intensity from ChIP-seq) or binary values
(e.g. existance of CpG islands in a given position), however, genomic signals
sometimes can also be represented as categorical or discrete values. A very
typical example is chromatin states segmentation by
[ChromHMM](http://compbio.mit.edu/ChromHMM/). ChromHMM basically assigns a
chromatin state (e.g. active transcription state or repressive transcription
state) to a given window in the genome and the assignment of states in
different windows are always mutually exclusive (which means one window can
only have one state). In this vignette, We will demonstrate how to visualize
the enrichment of various chromatin states around certain genomic targets and
how to integerate with other epigenomic signals.

## General examples

In following example, we use [Roadmap
dataset](http://www.roadmapepigenomics.org/) as the example dataset. First we
show basic usages with using one sample (sample id: E003, embryonic stem cell,
H1 cell line).

Load packages:

```{r}
library(GenomicRanges)
library(data.table)
library(EnrichedHeatmap)
library(circlize)
```

The chromatin state segmentation is trained and applied from [five histone modifications](http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state). Following file in use is from http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E003_15_coreMarks_mnemonics.bed.gz.

```{r}
states_bed = fread("zcat ~/EnrichedHeatmap_test/E003_15_coreMarks_mnemonics.bed.gz")
states = GRanges(seqnames = states_bed[[1]], ranges = IRanges(states_bed[[2]] + 1, states_bed[[3]]), 
	states = states_bed[[4]])
unique(states_bed[[4]])
```

In the 15 states, there are some states which are similar to each other such as `1_TssA` (active TSS) and `2_TssAFlnk` (flanking active TSS).
To reduce the complexity of the analysis, we merge some of the similar states.

```{r}
map = c(
	"1_TssA"      = "TssActive",
	"2_TssAFlnk"  = "TssActive",
	"3_TxFlnk"    = "Transcript",
	"4_Tx"        = "Transcript",
	"5_TxWk"      = "Transcript",
	"6_EnhG"      = "Enhancer",
	"7_Enh"       = "Enhancer",
	"8_ZNF/Rpts"  = "Heterochromatin",
	"9_Het"       = "Heterochromatin",
	"10_TssBiv"   = "TssBivalent",
	"11_BivFlnk"  = "TssBivalent",
	"12_EnhBiv"   = "Enhancer",
	"13_ReprPC"   = "Repressive",
	"14_ReprPCWk" = "Repressive",
	"15_Quies"    = "Quiescent"
)
states$states_simplified = map[states$states]
```

Also we set the colors for the 7 merged states.

```{r}
states_col = c(
	"TssActive"       = "Red",
	"Transcript"      = "Green",
	"Enhancer"        = "Yellow",
	"Heterochromatin" = "PaleTurquoise",
	"TssBivalent"     = "Orange",
	"Repressive"      = "Grey",
	"Quiescent"       = "black"
)
states_name = names(states_col)
n_states = length(states_col)
```

In following, we demonstrate how to normalize the chromatin states to TSS. First we load 
the transcriptome and extract TSS regions. The transcriptome annotation is from http://egg2.wustl.edu/roadmap/data/byDataType/rna/annotations/gen10.long.gtf.gz and we only use protein coding genes. The database file for transcriptome
was generated by **GenomicFeatures** package (the `makeTxDbFromGFF()` function).


```{r}
library(GenomicFeatures)
txdb = loadDb("~/EnrichedHeatmap_test/gencode19_protein_coding_txdb.sqlite")

g = genes(txdb)
tss = promoters(g, upstream = 0, downstream = 1)
```

To reduce the running time, here we only use chromosome 1. Normalizing categorical signals
is basically as same as numeric signals. Here we only need to specify the column name for
the categorical signals.

```{r}
tss_chr1 = tss[seqnames(tss) == "chr1"]
mat_states = normalizeToMatrix(states, tss_chr1, value_column = "states_simplified")
mat_states
```

In the last line of the message, it clarifies it is categorical signals with 7
levels.

The implementation of normalizing categorical signals is actually very simple.
Internally, `n_states` normalized matrices are generated where each one
corresponds to one chromatin state and the value in each window is the
fraction how much the window is overlapped to the state (with `w0` or
`weighted` mean mode). Since the i^th row and the j^th column in all matrices
correspond to a same window around a same target, if there are multiple states
overlap to this window, when summarizing from all states, the state with
largest overlap fraction is assigned to this window.

There is some special visualization designed for categorical signals where
each state is summarized and visualized separatedly in the top annotation.
Here `states_col` must be a named vector where the names should correspond to
the levels of the categorical signals.

```{r categorical_default, fig.width = 4}
EnrichedHeatmap(mat_states, name = "states", col = states_col)
```

You might feel the row ordering is a little bit in a mess. Although the signals are 
categorical, internally, there are coded as integer numbers. Just similar as how **factor** in R
is stored, each categorical level corresponds to an integer number. If the signal column
is simply a character vector, the assignment of integers are based on the natural ordering
of this character vector. Thus the numeric coding for signals in `states` object is:

```{r}
data.frame(states = unique(states$states_simplified), value = 1:7)
```

Zero is assigned to a window if none of the states overlap to it and it is drawn with white.

The numeric coding can be controlled by setting the signals as a factor variable and the level order of the factor
controls the corresponding coding. In following code, we convert `states_simplified` as a 
factor with specifying the level order (note here the order of `states_name` also reflects the closeness
of different states.).


```{r categorical_factor, fig.width = 4}
states$states_simplified = factor(states$states_simplified, levels = states_name)
mat_states = normalizeToMatrix(states, tss_chr1, value_column = "states_simplified")
EnrichedHeatmap(mat_states, name = "states", col = states_col)
```

We suggest to apply hierarchical clustering to reorder rows. Here the numeric coding is espeically
important because it affects the calculation of distance between rows.

```{r categorical_clustered, fig.width = 4}
EnrichedHeatmap(mat_states, name = "states", col = states_col, cluster_rows = TRUE)
```

Since the normalized matrix is numeric, k-means clustering can also be applied.

```{r categorical_kmeans, fig.width = 4}
EnrichedHeatmap(mat_states, name = "states", col = states_col, cluster_rows = TRUE, km = 2)
```

After closely looking at above heatmap, we found the TSS states are consistently enriched around TSS
while in the flanking regions, the states are a little bit diverse. Actually this suggests that in order to enhance
the pattern of the TSS-related states, we can apply k-means clustering only for the states near TSS regions.
In following we only clustering the matrix which is upstrean and downstream 1kb of TSS (The default
extension of TSS is 5kb upstream and downstream, the number of columns in the normalized matrix is 100,
thus from 40^th column to 60^th column are the states in 1kb upstream and downstream of TSS).

Of couse we need to calculate this partitioning in advance.

```{r categorical_partial_kmeans, fig.width = 4}
split = kmeans(mat_states[, 40:60], centers = 2)$cluster
EnrichedHeatmap(mat_states, name = "states", col = states_col, cluster_rows = TRUE, split = split)
```

Now it is quite nice to see genes with active TSS are all clustered in cluster 2 while in cluster 1, genes are
either have no function or have bivalent TSS function.

Similarly, we can visualize how the chromatin states enriched at gene bodies. Since the gene bodies have
unequal widths, we add an additional point plot to show the width of genes.

```{r categorical_gene_body, fig.width = 5}
g_chr1 = g[seqnames(g) == "chr1"]
mat_states_2 = normalizeToMatrix(states, g_chr1, value_column = "states_simplified")
EnrichedHeatmap(mat_states_2, name = "states", col = states_col, cluster_rows = TRUE) +
rowAnnotation(gene_len = row_anno_points(log10(width(g_chr1) + 1), size = unit(1, "mm"),
	gp = gpar(col = "#00000040"), axis = TRUE), width = unit(4, "cm"))
```

Next we add methylation around TSS as well as expression for the corresponding genes.
Expression data is from http://egg2.wustl.edu/roadmap/data/byDataType/rna/expression/57epigenomes.RPKM.pc.gz
and methylation data is from http://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/FractionalMethylation.tar.gz

In following, the raw methylation data has been smoothed by **bsseq** package. The rds file was generated
by the old version of **bsseq** package that is why we used a strange way to extract the positions of CpG
sites as well as smoothed methylation data which can be replaced by using `granges()` and `getMeth()` functions if
the smoothing is applied with newest version of **bsseq** package.


```{r}
expr = read.table("~/EnrichedHeatmap_test/57epigenomes.RPKM.pc.gz",
	row.names = 1, header = TRUE)

obj = readRDS("~/EnrichedHeatmap_test/chr1_roadmap_merged_bsseq.rds")
meth = obj@rowData
meth_mat = obj@trans(obj@assays$data$coef)
mcols(meth) = meth_mat
```

We take the genes which exist both in `tss_chr1` and `expr`.

```{r}
names(tss_chr1) = gsub("\\.\\d+$", "", names(tss_chr1))
cn = intersect(names(tss_chr1), rownames(expr))
tss_chr1 = tss_chr1[cn]
expr = expr[cn, ]
```

We normalize chromatin states as well methylation to TSS on chromosome 1.

```{r}
mat_states = normalizeToMatrix(states, tss_chr1, value_column = "states_simplified")
mat_meth = normalizeToMatrix(meth, tss_chr1, value_column = "E003", mean_mode = "absolute",
	smooth = TRUE)
```

Similarly, k-means is applied on chromatin states 1kb upstream and downstream of TSS.

```{r categorical_with_meth , fig.width = 6}
split = kmeans(mat_states[, 40:60], centers = 2)$cluster
meth_col_fun = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red"))
EnrichedHeatmap(mat_states, name = "states", col = states_col, cluster_rows = TRUE, split = split,
	top_annotation = HeatmapAnnotation(enrich = anno_enriched(gp = gpar(lty = 1:2)))) +
EnrichedHeatmap(mat_meth, name = "meth", col = meth_col_fun,
	top_annotation = HeatmapAnnotation(enrich = anno_enriched(gp = gpar(lty = 1:2)))) +
Heatmap(log2(expr[names(tss_chr1), "E003"] + 1), name = "expr", 
	show_row_names = FALSE, width = unit(5, "mm"))
```

With corresponding to more epigenomic signals, we can see promoters with active TSS states are always
associated with low methylation and high gene expression while promoters annotated as bivalent states
are generally more lowly expressed.

## Bivalent TSS

It is believed that in embryonic stem cell, there are a huge number of genes
showing bivalency states on promoters. They are called bivalent states
because both H3K4me3 which is a histone mark for active transcription and H3K27me3
which is for repressive transcription exist. The loss/gain of active histone
mark or repressive histone mark which transite into mature cells is essential for tissue development and differentiation.

To demonstate this, we anlaysis such transitions for one mature tissue (lung tissue,
roadmap sample id: E096).

```{r}
states_bed = fread("zcat ~/EnrichedHeatmap_test/E096_15_coreMarks_mnemonics.bed.gz")
states_lung = GRanges(seqnames = states_bed[[1]], ranges = IRanges(states_bed[[2]] + 1, states_bed[[3]]), 
	states = states_bed[[4]])
states_lung$states_simplified = factor(map[states_lung$states], levels = states_name)
```

ChromHMM is applied with 200bp window, thus we split whole genome by 200bp window and assign corresponding
states to each window, both for ESC and lung.

```{r}
window = makeWindows(states, w = 200)
mtch = as.matrix(findOverlaps(window, states))
window$ESC_states = states$states_simplified[mtch[, 2]]
mtch = as.matrix(findOverlaps(window, states_lung))
window$lung_states = states_lung$states_simplified[mtch[, 2]]
```

We only use the windows which are annotated with `TssBivalent` state either in ESC or in lung.

```{r}
window = window[window$ESC_states == "TssBivalent" | window$lung_states == "TssBivalent"]
```

We first make a Chord diagram to show how the transsition happens.

```{r}
transition_mat = table(mcols(window)[, c("ESC_states", "lung_states")]) * 200
class(transition_mat) = "matrix"
transition_mat = transition_mat[states_name, states_name]
transition_mat
```

The values in the transition matrix mean how many base pairs change from
one chromatin state to the other state.

Rows and columns are from different cells, we add prefix to row and column names.

```{r}
rownames(transition_mat) = paste0("ESC_", rownames(transition_mat))
colnames(transition_mat) = paste0("lung_", colnames(transition_mat))
```

Now we make the chord diagram.

```{r tssbiv_chord_diagram, fig.width = 10, fig.height = 6}
circos.par(gap.after = c(rep(1, n_states - 1), 5, rep(1, n_states - 1), 5))
grid.col = c(states_col, states_col)
names(grid.col) = c(rownames(transition_mat), colnames(transition_mat))
chordDiagram(transition_mat, grid.col = grid.col, annotationTrack = c("grid", "axis"),
	directional = TRUE)
circos.clear()
text(0.5, -1, "ESC")
text(0.5, 1, "lung")
legend("left", pch = 15, col = states_col, legend = names(states_col))
```

As we can see from the Chord diagram, a lot of regions with bivalent tss states have been
tansite to active or repressive states in lung.

To get more deep of how the transistion looks like, we can make enriched heatmaps to see
the associations between different epigenomic signals.

Since `TssBivalent` states basically are states for TSS related regions,
we only keep genes for which in 1kb upstream and downstream of TSS there must be a window with TssBivalent state.
Again, we only use chromosome 1 as demonstration.

```{r}
mat_bivtss = normalizeToMatrix(states[states$states_simplified == "TssBivalent"], tss_chr1)
l = rowSums(mat_bivtss[, 40:60]) > 0 # 1kb upstream and downstream
tss_biv = tss_chr1[l]
tss_biv
```

We normalize chromatin states and methylation to `tss_biv`.

```{r}
mat_states_ESC = normalizeToMatrix(states, tss_biv, value_column = "states_simplified")
mat_states_lung = normalizeToMatrix(states_lung, tss_biv, value_column = "states_simplified")

mat_meth_ESC = normalizeToMatrix(meth, tss_biv, value_column = "E003", mean_mode = "absolute",
	smooth = TRUE)
mat_meth_lung = normalizeToMatrix(meth, tss_biv, value_column = "E096", mean_mode = "absolute",
	smooth = TRUE)
```

Normally, methylation changes happen at the border of low methylation region. To enhance the effect
of methylation change, we directly visualize the methyaltion difference.

```{r}
mat_meth_diff = mat_meth_ESC - mat_meth_lung
meth_diff_col_fun = colorRamp2(c(-0.25, 0, 0.25), c("#3794bf", "#FFFFFF", "#df8640"))
```

Now we construct the heatmap list.

```{r}
ht_list = EnrichedHeatmap(mat_states_ESC, name = "states_ESC", col = states_col,
	top_annotation = HeatmapAnnotation(enrich = anno_enriched(gp = gpar(lty = 1:2), ylim = c(0, 1))),
	column_title = "States ESC") +
EnrichedHeatmap(mat_states_lung, name = "states_lung", col = states_col,
	top_annotation = HeatmapAnnotation(enrich = anno_enriched(gp = gpar(lty = 1:2), ylim = c(0, 1))),
	show_heatmap_legend = FALSE, column_title = "States lung")

ht_list = ht_list + EnrichedHeatmap(mat_meth_ESC, name = "meth_ESC", col = meth_col_fun,
	top_annotation = HeatmapAnnotation(enrich = anno_enriched(gp = gpar(lty = 1:2), ylim = c(0, 1))),
	column_title = "Meth ESC") +
EnrichedHeatmap(mat_meth_lung, name = "meth_lung", col = meth_col_fun,
	top_annotation = HeatmapAnnotation(enrich = anno_enriched(gp = gpar(lty = 1:2), 
		ylim = c(0, 1))),
	show_heatmap_legend = FALSE, column_title = "Meth lung") +
EnrichedHeatmap(mat_meth_diff, name = "meth_diff", col = meth_diff_col_fun,
	top_annotation = HeatmapAnnotation(enrich = anno_enriched(gp = gpar(lty = 1:2, pos_col = "#df8640", neg_col = "#3794bf"))),
	column_title = "Meth ESC - lung")
```

Actually we can discretize the numeric `mat_meth_diff`. Here we assign `hyper` to windows with methylation 
difference larger than 0.2 and `hypo` to windows with methylation difference less than -0.2. Here `discretize()` converts
a matrix with continuous signals to categorical signals by prividing a list of intervals.

```{r}
mat_meth_diff_discrete = discretize(mat_meth_diff,
	rule = list(
		"hypo" = c(-Inf, -0.2),
		"hyper" = c(0.2, Inf)
	)
)
mat_meth_diff_discrete
```

We continue to add more heatmaps.


```{r}
ht_list = ht_list + EnrichedHeatmap(mat_meth_diff_discrete, name = "meth_diff_discrete", 
	col = c("hyper" = "#df8640", hypo = "#3794bf"),
	top_annotation = HeatmapAnnotation(enrich = anno_enriched(gp = gpar(lty = 1:2))))
```

Finally the heatmap for gene expression.

```{r}
e = log2(expr[names(tss_biv), c("E003", "E096")] + 1)
ht_list = ht_list + Heatmap(e, name = "expr", 
	show_row_names = FALSE, width = unit(10, "mm"), cluster_columns = FALSE)
```

The row ordering for all heatmaps is from hierarchical clustering on the merged matrix from states in ESC and lung, 1kb upstream and downstream of TSS.

```{r categorical_with_lung, fig.width = 12}
row_order = hclust(dist(cbind(mat_states_ESC[, 40:60], mat_states_lung[, 40:60])))$order
split = ifelse(e[, "E096"] > e[, "E003"], "activation", "repression")
draw(ht_list, row_order = row_order, split = split)
```

## Other examples

There are some other examples where the genomic signals are categorical.

1. repeats where different repeat family can be different categories
2. genome segmentation based on methylation. E.g. hign methylated regions (HMRs),
   partially methylated regions (PMDs), low methylated regions (LMRs), unmethylated regions (UMRs)
3. for gene-related regions, genic annotation is also categorical.

## Session Info

```{r}
sessionInfo()
```

