GUI – Data preparation
Overview
Sequence and annotation data for various genome builds for mouse and human can be downloaded and preprocessed via mouse clicks. Lists of target gene symbols can be uploaded from a text file or from the clipboard. Alternatively, a shortcut allows to select all protein-coding genes as capture targets. GOPHER’s functionality is not restricted to transcription start sites. Arbitrary genomic position can be uploaded in BED6 format, for instance GWAS hits.
Choosing a genome build
GOPHER currently supports hg19 (i.e., GRCh37), GRCh38, mm9 and mm10. Choose the appropriate genome build from the pulldown menu.
Click on the Download
button and choose a directory for the genome.
GOPHER will download the corresponding genome file from the UCSC Genome Browser unless it finds the correspding files in the directory, in which case it will merely show the Done mark.
Note that the full download may take many minutes or longer depending on your network bandwidth.
The UCSC Genome files are provided as compressed (gzip) tar archives.
Clicking on the Start
button after Decompress genome will decompress the files if necessary.
If GOPHER finds the unpacked files in the genome directory, it will show the Done mark immediately.
The genome FASTA files (one for each chromosome) need to be indexed.
GOPHER indexes the files to produce .fai
index files that are equivalent to those produced by samtools.
If desired you can use samtools yourself.
If GOPHER finds the .fai
files in the directory, it will show the Done mark immediately.
GOPHER additionally requires a transcript definition file.
It will automatically download the correct file from UCSC if the user clicks on the Download
button after Transcripts.
It is recommended to store the file in the same directory as the genome file.
If GOPHER finds the file in the direcotry (refGene.txt.gz
) it will show the Done mark immediately.
Finally, GOPHER needs to download the alignabilty map from UCSC.
For a given kmer size \(k\), the alignabilty map \(p \mapsto a_k(p)\) indicates how often the kmer starting at position \(p\) occurs within in the entire genome.
GOPHER will automatically download the it correct file from UCSC if the user clicks on the Download
button after Alignabilty map.
Choosing enrichment targets
Following the steps described above, the user specifies the desired enrichment targets.
Clicking on the Enter gene list
button will open a dialog to enter a gene list.
Currently, GOPHER expects a list of valid (HGNC) gene symbols.
For promoter CHC, gene symbols can be uploaded from a text file with the Target Gene List
button.
Use the All protein-coding genes
button to extract all protein-coding gene symbols from the transcript file.
Next, click the Validate
button.
If gene symbols are used that do not occur in the downloaded annotation data, GOPHER will issue a warning and report a list of unmappable symbols that can be used to search for the current correct symbols.
Finally, click the Accept
button.
Alternatively, the All protein-coding genes
button allows promoters of all-protein coding genes to be selected as targets.
GOPHER also accepts a BED file with genomic positions.
For instance, the coordinates of GWAS hits can be uploaded in BED6 format.
Click the Enter BED file
button and select the file.
GOPHER also has two sample gene lists for human and mouse genomes that can be used as enrichment targets.
To access the lists, click on Help > Example gene sets
.
Select human
or mouse
and click Validate
and Accept
.