---
title: "Clustering of Hi-C contact maps"
author: "Shubham Chaturvedi, Pierre Neuvial, Nathalie Vialaneix"
date: "`r Sys.Date()`"
output: 
  html_document:
    toc: yes
vignette: >
  %\VignetteIndexEntry{Clustering of Hi-C contact maps}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r skipNoHITC}
# IMPORTANT: this vignette can not be created if HiTC is not installed
if (!require("HiTC", quietly = TRUE)) {
  knitr::opts_chunk$set(eval = FALSE)
}
```

# Introduction

Hi-C is a sequencing-based molecular assay designed to measure intra and 
inter-chromosomal interactions between the DNA molecule.  In particular, the 
identification of Topologically-Associated Domains (TADs), that is, of regions 
of the genome in which physical interactions are frequent, provides insight 
into the three-dimensional organization of a genome [2]. 

Hi-C data are in the form of two-dimensional *contact maps*, *i.e.*, matrices 
whose $i,j$ entry quantifies the intensity of the physical interaction between 
two genome regions $i$ and $j$ at the DNA level. In this vignette, we 
demonstrate the use of `adjclust::hicClust` to perform adjacency-constrained 
hierarchical agglomerative clustering (HAC) of Hi-C contact maps. The output of 
this function is a dendrogram, which can be cut to identify TADs. The algorithm 
used for adjacency-constrained (HAC) is described in [3,4].

```{r loadLib, message=FALSE}
library("adjclust")
```

# Loading and displaying a sample Hi-C contact map

The data set `hic_imr90_40_XX` is an object of class `HTCexp` which has been 
obtained from the `HiTC` package [4]. It is a contact map corresponding to the 
first 500 x 500 bins on chromosome X vs chromosome X. 


```{r loadData}
load(system.file("extdata", "hic_imr90_40_XX.rda", package = "adjclust"))
```

The script used to create this map can be found by executing the following command:

```{r create-data-script}
system.file("system/create_hic_chrXchrX.R", package="adjclust")
```

Now we have a look at the data.

```{r mapHiC, message=FALSE}
HiTC::mapC(hic_imr90_40_XX)
```

# Using `hicClust`

`hicClust` operates directly on objects of class `HTCexp`

```{r hicClust-HTCexp, message=FALSE}
fit <- hicClust(hic_imr90_40_XX)
```

It is also possible to work on binned data. Below we choose a bin size of 
$5 \times 10^5$:

```{r binning, message=FALSE}
binned <- HiTC::binningC(hic_imr90_40_XX, binsize = 1e5)
fitB <- hicClust(binned)
fitB
```

```{r plotBinned, message=FALSE}
HiTC::mapC(binned)
```

The output is of class `chac`. In particular, it can be plotted as a dendrogram
silently relying on the function `plot.dendrogram`:

```{r dendro}
plot(fitB, mode = "corrected")
```

Moreover, the output contains an element named `merge` which describes the 
successive merges of the clustering, and an element `gains` which gives the 
improvement in the criterion optimized by the clustering at each successive 
merge.

```{r objectDesc}
head(cbind(fitB$merge, fitB$gains))
```

# Other types of input     

Contacts maps can also be stored as objects of class `Matrix::dsCMatrix`, or as
plain text files. These types of input are also accepted as first argument to 
`hicClust`. 

# References

[1] Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N. (2019). 
Adjacency-constrained hierarchical clustering of a band similarity matrix with 
application to genomics. *Algorithms for Molecular Biology*, **14**, 22.

[2] Dixon J.R., *et al* (2012). Topological domains in mammalian genomes 
identified by analysis of chromatin interactions. *Nature*, **485**(7398), 376.

[3] Randriamihamison N., Vialaneix N., and Neuvial P. (2021). Applicability and
interpretability of Ward's hierarchical agglomerative clustering with or without
contiguity constraints. *Journal of Classification*, **38**, 363–389.

[4] Servant N., *et al* (2012). HiTC: Exploration of High-Throughput 'C' 
experiments. *Bioinformatics*, **28**(21), 2843-2844.

# Session information

```{r session}
sessionInfo()
```