Tutorial
This tutorial shows how to use Diachromatic for processing and quality control of Capture Hi-C reads. Before proceeding with the tutorial, please follow the program setup instructions to build Diachromatic and get bowtie2 as well as the hg19 prebuilt index.
Test dataset
To get the data, visit this ftp server or use:
wget ftp://ftp.jax.org/robinp/Diachromatic/test_dataset/test_1.fastq
wget ftp://ftp.jax.org/robinp/Diachromatic/test_dataset/test_2.fastq
wget ftp://ftp.jax.org/robinp/Diachromatic/test_dataset/hg19_HinDIII_DigestedGenome.txt.gz
Then decompress the digest file:
gunzip hg19_HinDIII_DigestedGenome.txt.gz
Truncation
The first step of processing raw FASTQ files with Diachromatic is to recognize and truncate reads with filled-in ligation juctions, which indicate reads that include the junction of the chimeric CHC fragment. This is performed with the truncate subcommand:
$ java -jar Diachromatic.jar truncate \
-q test_1.fastq \
-r test_2.fastq \
-e HinDIII \
-x prefix \
-o outdir
The command will create files called <prefix>.truncated_R1.fastq.gz and <prefix>.truncated_R2.fastq.gz in a subdirectory called outdir (which will be created if not already present).
See Truncation of chimeric reads for details.
Mapping
The second step of the pipeline is to map the truncated read pairs to the target genome. You also need a file that shows the locations of restriction digests across the genome. This file is included in the test dataset. You can use GOPHER to create probes and the digest file. Diachromatic uses bowtie2 to perform the mapping, and then creates a BAM file containing the valid read pairs.
Use the Mapping and categorization of Hi-C reads command to run the alignment step:
$ java -jar Diachromatic.jar align \
-b /usr/bin/bowtie2 \
-i /path/to/bowtie2index/hg19 \
-q prefix.truncated_R1.fastq.gz \
-r prefix.truncated_R2.fastq.gz \
-d hg19_HinDIII_DigestedGenome.txt \
-x prefix \
-o outdir
Note that the bowtie2 index can be downloaded from the
bowtie 2 site. The indices are available as zip archives,
e.g., GRCh37.zip. Once you unzip it, the resulting folder will contain multiple files (GRCh37.1.bt2, GRCh37.3.bt2, GRCh37.rev.1.bt2
GRCh37.2.bt2, GRCh37.4.bt2, GRCh37.rev.2.bt2). You need to pass the path to one of these files without the file suffix.
Assuming the directory is located at /some/path/GRCh37, you would therefore pass -i /some/path/GRCh37/CRCh37.
The diachromatic align command above will create a file called <prefix>.valid_pairs.aligned.bam in the outdir subdirectory.
See Mapping and categorization of Hi-C reads for details.
Counting
Use the following command to run the counting step:
$ java -jar Diachromatic.jar count \
-v prefix.valid_pairs.aligned.bam \
-d hg19_HinDIII_DigestedGenome.txt \
-x prefix \
-o outdir
The Counting of valid read pairs between pairs of restriction fragments command produces output files intended for downstream analysis, including <prefix>.count.stats.txt
and <prefix>.frag.sizes.counts.script.R. See Counting of valid read pairs between pairs of restriction fragments for details.
Summarize
To run the summarize command with the truncate data, use the following command.
$ java -jar Diachromatic.jar summarize \
-t outdir/prefix.truncation.stats.txt \
-a outdir/prefix.align.stats.txt \
-c outdir/prefix.count.stats.txt \
-x prefix \
-o outdir
The Summarize results command generates an HTML file called outdir/prefix.summary.stats.html.
The summary results file for the test dataset can also be downloaded from the ftp server or use:
wget ftp://ftp.jax.org/robinp/Diachromatic/test_dataset/test_dataset.summary.stats.html