Exporting BED file
Examing GOPHER’s output in the UCSC genome browser
By clicking on Export > Save BED filed as ...
, GOPHER outputs a series of BED files.
Here are the first three lines of a typical file:
chr11 3814753 3815000 NUP98
chr11 113929420 113929670 ZBTB16
chr17 66507972 66508222 PRKAR1A
Each line represents one region for which we would like to generate a probe. The regions are located on the edges of the restriction fragments chosen by the user with GOPHER.
One way to validate the results of this analysis are to compare the chosen fragments with the locations of restriction enzyme cutting sites in the genome. Users can generate a BED file for any enzyme and load both BED files as custom tracks to UCSC. Following this,the user can copy URLs from the GOPHER browser itself and visualize the viewpoints and the fragments they contain together with the information from the BED files.
There are many ways to generate BED files representing restriction sites. We will show one method that uses several Bioconductor libraries (see http://bioconductor.org/ for information on how to install Bioconductor and the required packages). The following code extracts all DpnII sites in the GRCh37 (also known as hg19) build of the human genome. This sites have a zero overhang.
library(HiTC)
library(rtracklayer)
library(BSgenome.Hsapiens.UCSC.hg19)
human_chr <- seqlevels(BSgenome.Hsapiens.UCSC.hg19)[1:25]
resFrag <- getRestrictionFragmentsPerChromosome(resSite="GATC", chromosomes=human_chr, overhangs5=0, genomePack="BSgenome.Hsapiens.UCSC.hg19")
allRF <- do.call("c",resFrag)
names(allRF) <- unlist(sapply(resFrag, function(x){paste0("HIC_", seqlevels(x), "_", 1:length(x))}))
export(allRF, format="bed", con="DpnII_resfrag_hg19.bed")
The BED file generated by this script will contain lines that represent restriction fragments located between individual cutting sites. Note that the BED format is zero-based and half closed (see https://genome.ucsc.edu/FAQ/FAQformat.html#format1).
$ head DpnII_resfrag_hg19.bed
chr1 0 11159 HIC_chr1_1 0 +
chr1 11159 12410 HIC_chr1_2 0 +
chr1 12410 12460 HIC_chr1_3 0 +
chr1 12460 12685 HIC_chr1_4 0 +
chr1 12685 12828 HIC_chr1_5 0 +
chr1 12828 13314 HIC_chr1_6 0 +
chr1 13314 13419 HIC_chr1_7 0 +
chr1 13419 13565 HIC_chr1_8 0 +
chr1 13565 13697 HIC_chr1_9 0 +
chr1 13697 13914 HIC_chr1_10 0 +
We will add the following line to the top of this file in order to display the fragments in blue for better visibility.
track name=spacer description="DpnII sites" color=0,0,255,
This can be done with sed
as follows.
sed -i '1i track name=spacer description="DpnII sites" color=0,0,255,' DpnII_resfrag_hg19.bed
We remove all mitochondrial fragments, which are not relevant for CHC (one of the entries also seems to cause an import error at the UCSC site).:
sed -i '/^chrM/d' DpnII_resfrag_hg19.bed
Note that the fragments above tile the entire genome. We would like to extract a list of fragments that are near to transcription start sites and then to extract the borders of these fragments, since these are the regions that are enriched for capture Hi-C. To do so, we will first extract a list of transcription start sites from gencode (https://www.gencodegenes.org/).
We will use standard unix tools as well as BEDtools (http://bedtools.readthedocs.io/en/latest/).
Our gene data file will be taken from gencode release 19 (GRCh37/hg19), the comprehensive gene annotation file (GTF): gencode.v19.annotation.gtf_withproteinids
.
We first filter for functional, non-mitochondrial transcripts.
grep -v '^#' gencode.v19.annotation.gtf |
awk '/\t(HAVANA|ENSEMBL)\t(transcript)\t/ {print}'|
grep -w "gene_status \"KNOWN\"" |
grep -w "transcript_status \"KNOWN\"" |
grep -v pseudogene |
grep -v nonsense_mediated_decay |
grep -v retained_intron |
grep -v chrM >
gencode.v19.annotation.filtered_transcript.gtf
Now, we extract the 50 nucleotides upstream of each TSS.
cat gencode.v19.annotation.filtered_transcript.gtf |
awk '{if ($7 =="-") {print $1"\t"$5"\t"$5+50"\t"$10"|"$12"\t0\t"$7} \
else {print $1"\t"$4-50"\t"$4"\t"$10"|"$12"\t0\t"$7}}' |
sed -e 's/"//g' -e 's/;//g' |
grep chr |
bedtools sort -i stdin > gencode.v19.annotation.filtered_tss_50bp_upstream.bed
This produces a file with the 50 nucleotide segments that have an overlap with the TSS.
$ head gencode.v19.annotation.filtered_tss_50bp_upstream.bed
chr1 36073 36123 ENSG00000237613.2|ENST00000461467.1 0 -
chr1 36081 36131 ENSG00000237613.2|ENST00000417324.1 0 -
chr1 69041 69091 ENSG00000186092.4|ENST00000335137.3 0 +
For instance, the first entry is derived from ENSG00000237613.2 on the negative strand of chromosome 1 at positions 34554-36081, and overlaps that entry between 36073 and 36081.
We now intersect the DpnII fragments with the 50 nucleotide TSS fragments, and filter for sizes between 146 and 20kb.
intersectBed -wa -wb -a DpnII_resfrag_hg19.bed \
-b ../annotation/gencode.v19.annotation.filtered_tss_50bp_upstream.bed |
awk '{if ($3-$2 > 146 && $3-$2 < 20000) print $0}'|
bedtools merge -d 1 -i stdin |
intersectBed -wa -wb -a stdin \
-b ../annotation/gencode.v19.annotation.filtered_tss_50bp_upstream.bed |
awk '{ if ($9=="-") {print $1"\t"$3-240"\t"$3"\t"$7"\t"$8"\t"$9} \
else {print $1"\t"$2"\t"$2+240"\t"$7"\t"$8"\t"$9}}'|
sed 's/|/\t/g' |
cut -f1-4,6-7|
sort -k1,1 -k2,2n |
uniq -> all_known_promoters_test_probe_design.bed
This produces a BED file
chr1 36431 36671 ENSG00000237613.2 0 -
chr1 68902 69142 ENSG00000186092.4 0 +
chr1 139632 139872 ENSG00000237683.5 0 -
chr1 158665 158905 ENSG00000222623.1 0 -
One now uses bedtools to make windows around the initial fragments of 120 bp.
bedtools makewindows \
-b all_known_promoters_test_probe_design.bed -w 120 -i src |
bedtools sort -i stdin > all_known_promoters_windows.sorted.bed
This produces a file of overlapping, 120 nt windows.
chr1 36431 36551 ENSG00000237613.2
chr1 36551 36671 ENSG00000237613.2
chr1 68902 69022 ENSG00000186092.4
chr1 69022 69142 ENSG00000186092.4
chr1 139632 139752 ENSG00000237683.5
Optionally, one can also use the bedtools nuc tool to remove initial fragments that contain repeats and also filter for GC content between 25% and 65%.
bedtools makewindows \
-b all_known_promoters_test_probe_design.bed -w 120 -i src |
bedtools nuc -fi hg19_nh.fa.masked -bed stdin -C -seq |
awk '{if ($6 >= 0.25 && $6 <= 0.65 && $13 ==120 ) print $0}' |
grep -v "NNN" |
bedtools sort -i stdin > all_known_promoters_windows_filtered.sorted.bed
Finally, we grab windows passing above criteria closest to the edge of DpnII fragment end.
intersectBed -wa -u -a DpnII_resfrag_hg19.bed \
-b all_known_promoters_windows.sorted.bed |
bedtools sort -i stdin |
closestBed -t first -a stdin \
-b all_known_promoters_windows.sorted.bed |
cut -f1-6 > Promoter_Capture_HiC_Gencode_V19_DpnII.bed
This produces the BED file that we can view in UCSC.
chr1 35641 36671 HIC_chr1_56 0 +
chr1 68902 69276 HIC_chr1_144 0 +
chr1 136170 139872 HIC_chr1_306 0 +
chr1 157269 158905 HIC_chr1_367 0 +
Now go to the My Data > Custom Tracks
page at the UCSC Genome Browser, click on the Custom Tracks button, and upload the targeted DpnII BED file.
Now you can view the DpnII sites, the chosen fragments,and the highlighted viewpoint regions (use the copy URL to clipboard to obtain a URL for a region of interest).
An analogous script can be used to check results for other restriction enzymes.