Exporting BED file

Examing GOPHER’s output in the UCSC genome browser

By clicking on Export > Save BED filed as ..., GOPHER outputs a series of BED files. Here are the first three lines of a typical file:

chr11  3814753    3815000    NUP98
chr11  113929420  113929670  ZBTB16
chr17  66507972   66508222   PRKAR1A

Each line represents one region for which we would like to generate a probe. The regions are located on the edges of the restriction fragments chosen by the user with GOPHER.

One way to validate the results of this analysis are to compare the chosen fragments with the locations of restriction enzyme cutting sites in the genome. Users can generate a BED file for any enzyme and load both BED files as custom tracks to UCSC. Following this,the user can copy URLs from the GOPHER browser itself and visualize the viewpoints and the fragments they contain together with the information from the BED files.

There are many ways to generate BED files representing restriction sites. We will show one method that uses several Bioconductor libraries (see http://bioconductor.org/ for information on how to install Bioconductor and the required packages). The following code extracts all DpnII sites in the GRCh37 (also known as hg19) build of the human genome. This sites have a zero overhang.

library(HiTC)
library(rtracklayer)
library(BSgenome.Hsapiens.UCSC.hg19)

human_chr <- seqlevels(BSgenome.Hsapiens.UCSC.hg19)[1:25]
resFrag <- getRestrictionFragmentsPerChromosome(resSite="GATC", chromosomes=human_chr, overhangs5=0,    genomePack="BSgenome.Hsapiens.UCSC.hg19")
allRF <- do.call("c",resFrag)
names(allRF) <- unlist(sapply(resFrag, function(x){paste0("HIC_", seqlevels(x), "_", 1:length(x))}))
export(allRF, format="bed", con="DpnII_resfrag_hg19.bed")

The BED file generated by this script will contain lines that represent restriction fragments located between individual cutting sites. Note that the BED format is zero-based and half closed (see https://genome.ucsc.edu/FAQ/FAQformat.html#format1).

$ head DpnII_resfrag_hg19.bed
chr1    0       11159   HIC_chr1_1      0       +
chr1    11159   12410   HIC_chr1_2      0       +
chr1    12410   12460   HIC_chr1_3      0       +
chr1    12460   12685   HIC_chr1_4      0       +
chr1    12685   12828   HIC_chr1_5      0       +
chr1    12828   13314   HIC_chr1_6      0       +
chr1    13314   13419   HIC_chr1_7      0       +
chr1    13419   13565   HIC_chr1_8      0       +
chr1    13565   13697   HIC_chr1_9      0       +
chr1    13697   13914   HIC_chr1_10     0       +

We will add the following line to the top of this file in order to display the fragments in blue for better visibility.

track name=spacer description="DpnII sites" color=0,0,255,

This can be done with sed as follows.

sed  -i '1i track name=spacer description="DpnII sites" color=0,0,255,' DpnII_resfrag_hg19.bed

We remove all mitochondrial fragments, which are not relevant for CHC (one of the entries also seems to cause an import error at the UCSC site).:

sed -i '/^chrM/d' DpnII_resfrag_hg19.bed

Note that the fragments above tile the entire genome. We would like to extract a list of fragments that are near to transcription start sites and then to extract the borders of these fragments, since these are the regions that are enriched for capture Hi-C. To do so, we will first extract a list of transcription start sites from gencode (https://www.gencodegenes.org/).

We will use standard unix tools as well as BEDtools (http://bedtools.readthedocs.io/en/latest/).

Our gene data file will be taken from gencode release 19 (GRCh37/hg19), the comprehensive gene annotation file (GTF): gencode.v19.annotation.gtf_withproteinids. We first filter for functional, non-mitochondrial transcripts.

grep -v '^#' gencode.v19.annotation.gtf |
  awk '/\t(HAVANA|ENSEMBL)\t(transcript)\t/ {print}'|
  grep -w "gene_status \"KNOWN\"" |
  grep -w "transcript_status \"KNOWN\"" |
  grep -v pseudogene |
  grep -v nonsense_mediated_decay |
  grep -v retained_intron  |
  grep -v chrM >
  gencode.v19.annotation.filtered_transcript.gtf

Now, we extract the 50 nucleotides upstream of each TSS.

cat gencode.v19.annotation.filtered_transcript.gtf |
awk '{if ($7 =="-") {print $1"\t"$5"\t"$5+50"\t"$10"|"$12"\t0\t"$7} \
  else {print $1"\t"$4-50"\t"$4"\t"$10"|"$12"\t0\t"$7}}' |
sed -e 's/"//g' -e 's/;//g' |
grep chr |
bedtools sort -i stdin > gencode.v19.annotation.filtered_tss_50bp_upstream.bed

This produces a file with the 50 nucleotide segments that have an overlap with the TSS.

$ head gencode.v19.annotation.filtered_tss_50bp_upstream.bed
chr1        36073   36123   ENSG00000237613.2|ENST00000461467.1     0       -
chr1        36081   36131   ENSG00000237613.2|ENST00000417324.1     0       -
chr1        69041   69091   ENSG00000186092.4|ENST00000335137.3     0       +

For instance, the first entry is derived from ENSG00000237613.2 on the negative strand of chromosome 1 at positions 34554-36081, and overlaps that entry between 36073 and 36081.

We now intersect the DpnII fragments with the 50 nucleotide TSS fragments, and filter for sizes between 146 and 20kb.

intersectBed -wa -wb -a DpnII_resfrag_hg19.bed \
  -b ../annotation/gencode.v19.annotation.filtered_tss_50bp_upstream.bed |
  awk '{if ($3-$2 > 146 && $3-$2 < 20000)  print $0}'|
  bedtools merge -d 1 -i stdin |
  intersectBed -wa -wb -a stdin \
  -b ../annotation/gencode.v19.annotation.filtered_tss_50bp_upstream.bed |
  awk '{ if ($9=="-") {print $1"\t"$3-240"\t"$3"\t"$7"\t"$8"\t"$9} \
    else {print $1"\t"$2"\t"$2+240"\t"$7"\t"$8"\t"$9}}'|
  sed 's/|/\t/g' |
  cut -f1-4,6-7|
  sort -k1,1 -k2,2n |
  uniq -> all_known_promoters_test_probe_design.bed

  This produces a BED file

  chr1        36431   36671   ENSG00000237613.2       0       -
  chr1        68902   69142   ENSG00000186092.4       0       +
  chr1        139632  139872  ENSG00000237683.5       0       -
  chr1        158665  158905  ENSG00000222623.1       0       -

One now uses bedtools to make windows around the initial fragments of 120 bp.

bedtools makewindows \
-b all_known_promoters_test_probe_design.bed -w 120 -i src |
bedtools sort -i stdin > all_known_promoters_windows.sorted.bed

This produces a file of overlapping, 120 nt windows.

chr1  36431   36551   ENSG00000237613.2
chr1  36551   36671   ENSG00000237613.2
chr1  68902   69022   ENSG00000186092.4
chr1  69022   69142   ENSG00000186092.4
chr1  139632  139752  ENSG00000237683.5

Optionally, one can also use the bedtools nuc tool to remove initial fragments that contain repeats and also filter for GC content between 25% and 65%.

bedtools makewindows \
-b all_known_promoters_test_probe_design.bed -w 120  -i src |
bedtools nuc -fi hg19_nh.fa.masked -bed stdin -C -seq |
awk '{if ($6 >= 0.25 && $6 <= 0.65 && $13 ==120 ) print $0}' |
grep -v "NNN" |
bedtools sort -i stdin > all_known_promoters_windows_filtered.sorted.bed

Finally, we grab windows passing above criteria closest to the edge of DpnII fragment end.

intersectBed -wa -u -a DpnII_resfrag_hg19.bed \
-b all_known_promoters_windows.sorted.bed  |
bedtools sort -i stdin |
closestBed -t first -a stdin \
-b all_known_promoters_windows.sorted.bed |
cut -f1-6 > Promoter_Capture_HiC_Gencode_V19_DpnII.bed

This produces the BED file that we can view in UCSC.

chr1  35641   36671   HIC_chr1_56     0       +
chr1  68902   69276   HIC_chr1_144    0       +
chr1  136170  139872  HIC_chr1_306    0       +
chr1  157269  158905  HIC_chr1_367    0       +

Now go to the My Data > Custom Tracks page at the UCSC Genome Browser, click on the Custom Tracks button, and upload the targeted DpnII BED file. Now you can view the DpnII sites, the chosen fragments,and the highlighted viewpoint regions (use the copy URL to clipboard to obtain a URL for a region of interest). An analogous script can be used to check results for other restriction enzymes.