The Unix Shell

Connecting to clusters (ssh, scp)

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • What is a cluster?

  • What HPC resources available at JAX?

  • How do I connect to the cluster?

  • How do I bring data in and out of the cluster?

Objectives
  • This lesson provides an overview of how a computer cluster (aka supercomputer or high performance computer) differs from a standard desktop or laptop, gives a summary of HPC resources at The Jackson Laboratory, the standard setup of the file system on these resources, and describes how to connect to these systems and copy data from one computer to another.

What is a cluster?

A typical modern personal computer has 2-8 processing cores, 2-16Gb of RAM, and generally has a hard drive storage capacity between 1-5 TB. While these resources are generally sufficient for modern day to day work tasks- word processing, spreadsheets, e-mail, web browsing, etc., they are typically not enough to serve as multiuser servers or perform complex scientific analysis.

Server hardware is generally more powerful than that in personal computers, and can range from 2-64 processing cores, with 1gb-12TB of RAM, and is generally has both local storage on the 2-8TB level while having access to network attached storage with capacities ranging into the peta- and exa- bytes.

High performance computing links multiple servers together into high performance computers (or computer clusters/super computers), and serves the computational needs of multiple users through a queue system. Users submit jobs (that can have varying priority) to the queue system, and the queue’s scheduling algorithm determines when and where to run a given user’s request (also known as a job).

HPC Resources at JAX

The Information Technology (IT) platforms at JAX enable computationally intensive, high-throughput, data-rich research to be conducted, with the goal of developing personalized genomic medicine methodologies and practices. The IT team supports our infrastructure and technology platforms, and includes several dedicated Research IT personnel. Key IT platforms are summarized below:

High-Performance Computing (HPC) Platform:

Helix (Farmington, CT): Helix includes 48 HP Proliant SL Series servers of 2.2GHz-16Core-256GB, and 48 2.5GHz-20Core-256GB, combined into a 1,728-core high-performance cluster. The platform also includes a high-memory HP Proliant DL900 Series server of 2.2GHz-48Core-2TB.

Cadillac (Bar Harbor, ME): Cadillac includes 32 HP Proliant SL Series servers of 2.2GHz-16Core-256GB, and 10 2.3GHz-28Core-512GB, combined into a 960-core high-performance cluster. The platform also includes a GPU node with 4 K20x GPGPUs and 2 high-memory HP Proliant DL580 Series servers of 2.2GHz-20Core-1TB.

The JAX HPC infrastructure enables analysis and inference in genomics medicine projects with a wide range of complexity and computational intensities, and the high memory nodes on both Helix and Cadillac allow for memory-intensive genomic computations, such as de novo transcriptome assembly. The operating system deployed for all servers on these platforms is CentOS. These resources will be expanded as the needs grow.

Storage Platform:

HPC data storage at JAX is supported using two Isilion scale-out NAS systems, with 2PB of storage available on both Helix and Cadillac for the processing and analysis of scientific data. A geographically dispersed archiving system is being implemented for securing raw instrument data. These resources will be expanded as the needs grow.

Applications Platform:

Our applications platform supports all of the standard software necessary for investigators to process and analyze their data, as well as providing the entire community with the basic blocks for them to build their own custom workflows. Database development and deployment is supported by MySQL and other database management systems.

Network Infrastructure:

Our network platform supports both scientific and enterprise business systems, using 40Gb core switches in our server farm that delivers at least 1Gb to user devices. The environment includes wired and wireless network service and a redundant voice over IP (VOIP) system, and is protected by application firewalls. The network infrastructure for the HPC environment is comprised of a 40Gb backbone with 10Gb to each server in the cluster. Internet service is delivered by a commercial service provider that can scale beyond 1Gb as demand for data transfer increases.

SSH

Secure Shell (SSH) is a network protocol for connecting securely to computers and devices over an unsecured network. You’ve already been using it today to connect from your personal computer to helix-dev.jax.org.

Exercise: SSH to HELIX From your bash prompt - use SSH to connect to helix.jax.org, and from there, connect to cadillac.jax.org.

ssh ssander@helix.jax.org
ssh ssander@cadillac.jax.org
exit
exit

How many active jobs are running on Cadillac?

Overview of Common Directories

Ok, now that we are connecting to Helix through SSH, we should take a minute to examine and discuss the common directory structure of the system.

/home is where the home directories for every user are kept, and are accessible by all cluster nodes. When you log in, you shold be in your /home/userid directory. Let’s check that by using ssh to connect back to Helix, and then using the pwd command we learned earlier.

ssh ssander@helix.jax.org
pwd

Home directories have a quota of 50GB each, meaning that the maximum storage available to a single user’s /home/userid folder is 50GB.

/projects is the primary active storage directory for users and lab groups. It is accessible by all cluster nodes. Users have a /projects/userid folder with a quota of 5TB, and each labgroup have a /projects/PIname-lab folder with a quota of 50TB. Lab groups and services that require more space can work with the IT department for the acquisition and provisioning of additional storage capacity.

/data is a legacy directory that is in the process of being retired with its contents being migrated to /projects. It is accessible by all cluster nodes. It is at full capacity most of the time, which can cause numerous performance impacting issues. New users and labgroups will not have a /data/userid and /data/PIname-lab folder in /data. This space did not have a quota.

/fastscratch is a temporary directory pinned to the fastest in-house storage we have available. It is accessible by all cluster nodes. It has a total capacity of 100TB, and is meant to serve as ‘scratch paper’ for computational analysis. Users can specify that their jobs output to /fastscratch, and then copy their important files back into their /home or /project directories. Bioinformatics software can generate a large number of temporary and intermediate files, that are often no longer needed after the analysis completes, and can accidently consume large quantities of storage. Files in the /fastscratch directory that have not been accessed in over 14 days are erased by IT, and IT reserves the right to erase any and all data on /fastscratch (but we will provide appropriate e-mail notification that this is going to happen).

/gt_delivery is the delivery directory for sequence data from JAX’s Genome Technologies group. It is only available on the head nodes of Helix and Cadillac (helix.jax.org and cadillac.jax.org). Data is accessible by group (so PIname-lab groups), and is automatically archived to long term storage by IT.

SCP

Secure Copy (SCP) is like the Copy cp command you’ve already learned today, but uses the SSH network protocol to securely copy data between two computers.

scp [target] [destination] where [target] and [destination] can be either local or remote. To specify local or remote, use a format similar to userid@servername.com:/dir/to/file when connecting remotely, and use it like cp when reading or writing a local file.

scp ssander@cadillac.jax.org:/home/ssander/cadillac_file.txt /home/ssander

Exercise: SCP to HELIX from CADILLAC SSH to Cadillac, create a file using nano in your /home/userid directory on Cadillac, and then use secure copy that file to your /home/userid directory on Helix.

ssh ssander@cadillac.jax.org
nano
[type text and save file myfile_on_cadillac.txt]
scp myfile_on_cadillac.txt ssander@helix.jax.org:/home/ssander

The wget command

The commands wget and curl allow you to download files from the web to directories through the command line.

wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip

Exercise: wget

Use the wget command to download the Trimmomatic application. (Hint: You’ll need to go to the web and get the URL for the file to download).

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.36.zip

Key Points