01 Welcome and Introductions

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Who is providing this training?

Objectives
  • Introduction to the instructors and helpers.

Welcome and Introductions

Welcome to this introduction to HPC training.
Our HPC environment is like your digital lab space. Depending on what you are doing and who you’re doing it with, we have a few different lab spaces (or “clusters”) for you to work in.

Cluster Use case
Sumhpc General Purpose Research
Winhpc GPU-based Research
Other Scientific Services Use Only

This training will cover:

What about when I have questions later?

We are going to direct you to lots of places to gather information.

Knowledge Checks

Throughout, we will be asking you a few knowledge questions. Don’t think of them as a pop quiz, instead think of them like this:

Key Points

  • Welcome

  • Overview of training

  • Resources Available


02 Connecting to (and navigating in) the JAX HPC environment

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How to connect to a remote system?

  • What do I do on the command line?

Objectives
  • Connect to the cluster via terminal and SSH

SSH Basics

The take home from this lesson.

The details on how ssh works.

Example ssh connection for this class

[user@edu-vm-63d1410f-2 ~]$ ssh user@35.227.70.171
The authenticity of host '35.227.70.171 (35.227.70.171)' can\'t be established.
ECDSA key fingerprint is SHA256:UTHv5IOvrF9uvxuh9Fo8uW2bx0BwCRLyrwHhONoiIj8.
ECDSA key fingerprint is MD5:15:bb:25:2a:3a:45:f4:c7:df:21:26:37:12:66:79:77.
Are you sure you want to continue connecting (yes/no)?

SFTP

SFTP stands for secure file transfer protocol and is used to transfer files from one computer to another.

MobaXtem will automatically establish and sftp session when ssh’ing into a remote system. It is visible as a yellow globe on the left hand side of the MobaXterm terminal.

Key Points

  • ssh is the preferred way to connect to linux servers

  • ssh is secure and protects information sent

  • sftp available in a command line application and in GUI based applications for file transfers


02 From Computer to HPC

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • What are the components of a computer

  • What is the difference between a regular computer and an HPC

Objectives
  • Overview of General Computing

parts example

Computer Component Review

Consumer Computer vs Servers vs HPC vs Sumhpc

Component Home/Busines Computer Server Typical Individual Node in HPC Typical Total HPC System Individual Node on Sumhpc Total Sumhpc System
CPU (cores) 4 - 8 12 - 128 32 - 128 1000s 70* 7,000
RAM(GB) 8 -16 64 - 960 240 - 3000 64,000 754 - 3TB 76.8 TB
DISK (TB) .5 - 1 TB 8 - 100 None - 1 TB 100s (Networked) NA 2.7 PB
Networking (Gbe) .1 - 1 1 - 10 40 - 100 40 - 100 40 40 +

* note Sumhpc High Mem nodes contain 142 cores each and ~3TB of RAM.

HPCs are servers networked together and managed with a scheduler

cluster example

Key Points

  • HPCs are typically many large servers networked together

  • HPCs utilized networked disk space instead of local disk space


03 HPC Architecture Overview

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How is an HPC machine configured

Objectives
  • Overview of HPC architecture

HPC architecture overview

General

hpc example

Cluster

A cluster is a group of computers networked together.

Login Node

The login node is where you connect to the cluster and submit jobs.

Administrative Nodes

The administrative nodes are what manage the cluster scheduling and other admin tasks, users do not login to these.

Compute Nodes

The compute nodes are where the computational tasks are carried out. Sumhpc includes 100 Supermicro X11DPT-B Series servers with 70 Intel Xeon Gold 6150 computable cores at 2.7GHz and 768 GB RAM. Sumhpc include 2 high-mem nodes with 142 Intel Xeon Gold 6150 cores at 2.7GHz and 3 TB RAM available per none for workloads that exceed 768 GB RAM.

Cluster Accessible Storage

Working storage (Tier 0, /fastscratch) on Sumhpc is provided by a Data Direct Networks Gridscalar GS7k GPFS storage appliance with 522 TB usable storage capacity.

Primary storage (Tier 1) on Sumhpc is provided 27 Dell EMC Isilon scale-out NAS nodes combined for a total of 2.7 PB raw storage capacity. Your home drive (50 Gb capacity) is mounted tier 1 as is lab project folders.

Other storage (Tier 2) this 2 PB of capacity is non-HPC storage and is not accessible to HPC resources 2 PB of new non-computable capacity. Access to tier 2 is through the Globus application (link globus).

Archival Storage:

Archival storage (Tier 3) is provided at both Bar Harbor, ME and Farmington, CT by 2 Quantum Artico StorNext appliances with 72 TB front-end disk capacity backed by a 4 PB tape library at each site. Data storage at this tier is replicated across both geographic sites.

Network Infrastructure:

Our network platform supports scientific and enterprise business systems, using 40Gb core switches in our server farm that delivers at least 1Gb to user devices. The environment includes wired and wireless network service and a redundant voice over IP (VOIP) system, and is protected by application firewalls. The network infrastructure for the HPC environment is comprised of a dedicated 100Gb backbone with 50Gb to each server in the cluster, and 40Gb to each storage node. Internet service is delivered by a commercial service provider that can scale beyond 10Gb as demand for data transfer increases.

Fall Cluster

the Fall cluster includes 8 Supermicro X11DGQ Series servers with 46 Intel Xeon Gold 6136 at 3.00GHz and 192 GB RAM. Each server includes 4 Nvidia Tesla V100 32 GB GPU nvme cards. This translates into 249.6 TFLOPS of double precision floating-point, 502.4 TFLOPS of single precision and 4,000 Tensor TFLOPS of combined, peak performance.

Key Points

  • Login to Login nodes and submit job request to schedule

  • HPC utilizes various tiers of storage optimized to different usage types


04 What is a Job

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • What is a job?

Objectives
  • Understand the difference in running jobs locally and on HPC system.

  • Understand what function a job serves in the HPC environment.

A job

Key Points

  • Understand the difference in running jobs locally and on HPC system.

  • Understand what function a job serves in the hpc environment.


06 Directories

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What does the directory structure look like in the Sumhpc environment?

  • Where do I have space to store files?

  • What is the fastscratch and why is it erased?

  • How do I transfer files into and out of Sumhpc?

  • Are there additional storage locations ?

Objectives
  • Know the 3 writeable locations on Sumhpc: /home/${USER}, /fastscratch, /projects

  • Know files are removed from /fastscratch after 10 days.

  • Know that additional storage is located in on Tier2 and Tier3 (archive), but these are not directly accessible on the HPC system.

  • Know globus is used to transfer data into and out of the HPC environment

What you will need to know today

We utilize 4 Tiers of storage each optimized for specific role.

Tier Mounted as Backups? Notes
Tier0 /fastscratch None shared space for temporary computational use
Tier1 /home & /projects No*  
Tier2 Not Mounted to HPC Yes** Used for storing “cooler” data, not meant for HPC computations
Tier3 Not Mounted to HPC Yes** Not accessible via Globus. Service desk ticket required for archival and retrieval.

* Tier1 is not backed up but does have hidden .snapshot directory for recovery of deleted files within 7 days (this is not a true backup).
** Tiers 2 & 3 have cross-site replication between cross-site replication between BH and CT.

We have 3 general locations (directories) we use for the HPC

Details and Suggestions for Directory Management for Your Review.

These are some suggestions for directory management

Directory Structure Best Practice

Time spent at the beginning of the project to define folder hierarchy and file naming conventions will make it much easier to keep things organized and findable, both throughout the project and after project completion. Adhering to well-thought out naming conventions:

  • helps prevent accidental overwrites or deletion
  • makes it easier to locate specific data files
  • makes collaborating on the same files less confusing

File naming best practices

Include a few pieces of descriptive information in the filename, in a standard order, to make it clear what the file contains. For example, filenames could include:

  • experiment name or acronym
  • researcher initials
  • date data collected
  • type of data
  • conditions
  • file version
  • file extension for application-specific files

Consider sort order:

If it is useful for files to stay in chronological order, a good convention is to start file names with YYYYMMDD or YYMMDD.

If you are using a sequential numbering system, use leading zeros to maintain sort order, e.g. 007 will sort before 700.

Do not use special (i.e. non-alphanumeric) characters in names such as:
" / \ : * ? ‘ < > [ ] [ ] { } ( ) & $ ~ ! @ # % ^ , '

These could be interpreted by programs or operating systems in unexpected ways.

Do not use spaces in file or folder names, as some operating systems will not recognize them and you will need to enclose them in quotation marks to reference them in scripts and programs. Alternatives to spaces in filenames:

  • Underscores, e.g. file_name.xxx
  • Dashes, e.g. file-name.xxx
  • No separation, e.g. filename.xxx
  • Camel case, where the first letter of each section of text is capitalized, e.g. FileName.xxx
  • Keep names short, no more than 25 characters.

File Versioning Best Practices

File versioning ensures that you always understand what version of a file you are working with, and what are the working and final versions of files. Recommended file versioning practices:

  • Include a version number at the end of the file name such as v01. Change this version number each time the file is saved.
  • For the final version, substitute the word FINAL for the version number.
  • Take advantage of the versioning capabilities available in collaborative workspaces such as github OSF, Google Drive, and Box.
  • Track versions of computer code with versioning software such as Git, Subversion, or CVS.

Directory Structure Best Practices

Directories can be organized in many different ways. Consider what makes sense for your project and research team, and how people new to the project might look for files.

Once you determine how you want your directories to be organized, it is a good idea to stub out an empty directory structure to hold future data, and to document the contents of each directory in a readme file.

Directory Best Practices

  • Organize directories hierarchically, with broader topics at the top level of the hierarchy and more specific topics lower in the structure.
  • Group files of similar information together in a single directory.
  • Name directories after aspects of the project rather than after the names of individual researchers.
  • Once you have decided on a directory structure, follow it consistently and audit it periodically.
  • Separate ongoing and completed work.

Sources

http://guides.lib.umich.edu/datamanagement/files

Key Points

  • The 3 writeable locations on Sumhpc: /home/${USER}, /fastscratch, /projects

  • Files are removed from /fastscratch after 10 days.

  • Faculty must approve access to their /projects directory.

  • Good directory management can be helpful for organization.

  • Globus is used to transfer data into and out of the HPC environment (see lesson 14)


07 Slurm Basics

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is Slurm

Objectives
  • Introduction to basic Slurm commands and parameters

General Slurm Commands and Parameters

Usage Slurm Commands
submit job srun, sbatch, salloc
view cluster status squeue, sinfo
view cluster config scontrol show partition’ sacctmgr
view job information (history) sacct, seff, sstat
cancel job scontrol -f <JobID>’

Slurm documentation can be found here: https://slurm.schedmd.com/documentation.html

Usage: Submitting Jobs to Slurm

Usage: Viewing the current status of cluster

Usage: View cluster config

The commands below have multiple parts the whole command within the quotes should be executed.

Usage: View job information

Usage: Cancel job

Common Slurm Terms and Parameters for srun and sbatch

Common Name - flag --flag Definition/notes
Account -A --account  
Array (sbatch only) -a --array execute multiple jobs with one job request
CPUsPerTask -c --cpus-per-task Usually set to number of CPUs/threads req.
Dependency -d --dependency Make job execution dependent on previous job
Pass Variable   --export Pass BASH variable to job
Print help -h --help  
Hold -H --hold Queue job but do not run
JobName -J --job-name Set job name
Memory   --mem Set memory (RAM) requirement for job
Request specific nodes -w --nodelist  
Number of nodes per job -N --nodes Usually set to 1 (may vary w/advanced usage)
Number tasks per node -n --ntasks Usually set to 1 (may vary w/advanced usage)
Output -o --output  
Partition -p --partition  
QOS -q --qos  
Time Limit -t --time Set limit on job run time

Slurm Terms and Parameters for sacct

Common Name - flag --flag Definition/notes
JobID -j –jobs display information about job or jobs
UserID -u    
Output Format -o –format  
Output Format help -e –helpformat Print a list of fields for –format option
Output Format long style -l –long long format option
Jobs included after -S –starttime  
Jobs included before -E –endtime  
UserID -u –uid  
State of Jobs -s –state  
Nodes -N –nodelist Print jobs ran on these nodes

Usefull sacct format fields

Common Name Description
AllocCPUS  
AllocNodes  
CPUTime  
Elapsed  
JobID Jobs JobID
JobName Job Name
MaxRSS Max Memory Used for job
UserID  
NCPUS  
Partition Partition requested
QOS QOS requested
ReqMem MEM requested
Start Start time of job
End End time of job
State State of job
ExitCode Exit code from job

Key Points

  • srun can access an interactive job

  • sbatch executes batch job

  • srun and sbatch have similar parameters

  • scontrol and saccmgr can access the Sumhpc partition and QoS configurations

  • sacct, seff, stat can provide access to my job history and resource utilization


08 Sumhpc

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is Sumhpc

Objectives
  • Review of Sumhpc nodes

Sumhpc Basics

  compute high_mem dev
# of nodes 100 2 20
Usable memory per node 754GB 3022GB* 180GB
Usable cores per node 70 142 30
CPU core speed​​​​​​​ 2.70GHz 2.70GHz 3.60GHz
Total partition MEM 76T 6T 3.6T
Total partition cores 7000 284 600
how to use? -p compute -p high_mem -p dev & -q dev

*Note: The max available memory on the high_mem nodes is 3022GB, just shy of 3TB (3072GB binary unit), also the ‘--mem’ value is required to be an integer (no decimals).

**Note: To use the dev partition you must use both the -p dev and -q dev when submitting jobs. See QOS.

Key Points

  • Sumhpc has 3 compute node types compute, high_mem, and dev.

  • The compute node types refer to partitions in the slurm configuration.


09 Sumhpc Slurm Configuration

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How is Sumhpc configured with Slurm

  • How do partitions, QoS, and user set parameters work together?

Objectives
  • Understand how partitions, QoS, and user set parameters work together

A Layer Model to Slurm Configuration (build from bottom up)

Layer SLURM TERM Example(s) Notes
Layer 4 user set parameters -N, -n, -c, --mem --time user defines what they need for particular job
Layer 3 GRES - optional, GPU cluster only for now, not needed for intro-to-hpc
Layer 2 QoS batch, long, dev user limits set by cluster administration
Layer 1 Partition compute, dev, high_mem the hardware (based upon hardware properties)

Sumhpc’s Partitions

  compute high_mem dev
# of nodes 100 2 20
Usable memory per node 754GB 3022GB* 180GB
Usable cores per node 70 142 30
CPU core speed​​​​​​​ 2.70GHz 2.70GHz 3.60GHz
Total partition MEM 76T 6T 3.6T
Total partition cores 7000 284 600
MaxTime 14-00:00:00 3-00:00:00 8:00:00
how to use? -p compute -p high_mem -p dev & -q dev

Sumhpc’s QoS.

  batch long dev
Max Wallltime per job 3 days 14 days 8 hours
Default Walltime (if not specified) 1 hour 1 hour 1 hour
Max CPU per user 700 cores 140 cores 60 cores
Default CPU (if not specified) 1 core 1 core 1 core
Max Memory per user 7.6TB 1TB 360GB
Default Memory (if not specified) 1GB 1GB 1GB
Max running jobs per user 700 10 60
how to use? -q batch -q long -p dev & -q dev
Partitions Allowed On compute, high_mem compute, * dev

* long QoS is currently allowed on high_mem, but the high_mem partition only allows jobs to run for 3-00:00:00 (72 hr), so just use batch QoS. Also, long QoS is limited to 1TB RAM.

Important srun and sbatch parameters these are set per job.

Common Name - flag --flag Definition/notes
CPUsPerTask -c --cpus-per-task Usually set to number of CPUs/threads req.
Memory   --mem Set memory (RAM) requirement for job
Request specific nodes -w --nodelist  
Partition -p --partition  
QOS -q --qos  
Number of nodes per job -N --nodes Usually set to 1 (may vary w/advanced usage)
Number tasks per node -n --ntasks Usually set to 1 (may vary w/advanced usage)
Time Limit -t --time Set limit on job run time

Key Points

  • Partitions are sets of nodes with specific hardware configurations.

  • QoS are limits to resource (it is a shared systems).

  • Users also set specific parameters so the resources allocated by the scheduler match what is needed.

  • Job queue wait times are optimized when slurm configuration parameters are optimal for job.


10 Interactive SRUN Jobs

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How to run an interactive job with srun

Objectives
  • Introduction to srun interactive jobs.

  • Introduction to setting slurm parameters.

  • Highlight the how program resource parameters still need to be set.

Slurm srun Basics

To get an interactive node with srun (no MPI)

Hands on srun example

Look at the node CPU information

Lets take a look at the CPU(s) information on the node using the lscpu command

$ lscpu 

Result

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Stepping:              4
CPU MHz:               1199.871
CPU max MHz:           3700.0000
CPU min MHz:           1200.0000
BogoMIPS:              5400.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d

Look at the Node Available RAM (Memory)

Lets take a look at the available memory on the node using the free -h command

$ free -h 

Result

             total        used        free      shared  buff/cache   available
Mem:           754G         16G        292G        789M        445G        734G
Swap:           63G        319M         63G

Key Points

  • Set slurm run parameters correctly.

  • Set your script/program’s core and mem parameters correctly.


11 Software on Sumhpc

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How do we access software on Sumhpc

Objectives
  • Identify where software can be installed

  • How to access low-level software such as libraries, compilers, Singularity, and programming language binaries with Environment Modules.

  • Identify how and when to use singularity.

Software Usage on the SumHPC Cluster

On Sumhpc, we have overhauled how researchers are able to install and easily access the software they need. Users are still able to download and install software into their userspace (/projects or /home directories). Additionally, low level development tools such as gcc, openMPI, openJDK, and basic libraires are still available via the module system. The new and exciting way we are able to provide software on Sumhpc, however, is through software containers. Singularity containers are a cutting edge way of creating your own custom software modules. Containers not only provide you with the software you need, but also a contained environment which ensure that your software runs the exact same way whether it’s on your laptop, on the HPC resources, with a collaborator, or in the cloud.


Userspace Installation

The simplest method of installing software for use with Sumhpc is to install directly into your userspace. This includes any directories you have permissions to write to and modify. This includes your group’s /projects directory or your personal /home directory. You also have write permissions on /fastscratch and /tmp, but these are not recommended due to their ephemeral nature. Most software installations will default to writing under a /usr/ or /lib/ path. Since these locations are shared by everyone, these “global” trees are not writable by the typical HPC user. Instead, many software installations will allow you to perform a “local” install in a directory of your choosing. Please consult the documentation for your software to find how to change the install path for your software.

Environment Modules

Now, only low-level software such as libraries, compilers, Singularity, and programming language binaries are available through modules. To see what modules are available, run the command module avail. For more information about how to use the module command, see the documentation.

Singularity Containers

The newest and most liberating change to software installations on HPC is the introduction of Singularity. Containerization allows you to install whatever software you want inside of a singular, stand-alone Singularity image file (.sif). This file contains your software, custom environment, and metadata all in one.

To use Singularity, all you have to do is load the module by running module load singularity in a running job session on Sumhpc. Singularity can download new containers (singularity pull), run existing ones (singularity run/exec/shell).

Software Hands On Example.

Key Points

  • Users are able to download and install software into their userspace (/projects or /home directories)

  • Low-level software such as libraries, compilers, Singularity, and programming language binaries with Environment Modules.

  • Singularity is a new and liberating change to software installations on HPC.


12 Batch Jobs with sbatch

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What are the advantages of sbatch over srun?

  • How do I run a sbatch job?

  • Where can I find example headers?

  • How do I format sbatch headers?

  • How do I capture the jobid?

Objectives
  • Know how to format a sbatch script.

  • Know how to run a sbatch job.

  • Know how to capture the jobid.

  • Understand the usefulness of slurm arrays.

sbatch basics

The Slurm sbatch header

#!/bin/bash

#SBATCH --job-name=MY_JOB    # Set job name
#SBATCH --partition=dev      # Set the partition 
#SBATCH --qos=dev            # Set the QoS
#SBATCH --nodes=1            # Do not change unless you know what your doing (it set the nodes (do not change for non-mpi jobs))
#SBATCH --ntasks=1           # Do not change unless you know what your doing (it sets the number of tasks (do not change for non-mpi jobs))
#SBATCH --cpus-per-task=4    # Set the number of CPUs for task (change to number of CPU/threads utilized) [-p dev -q dev limited to 30 CPUs per node]
#SBATCH --mem=24GB           # Set to a value ~10-20% greater than max amount of memory the job will use (or ~6 GB per core, for dev) (limited to 180 GB per node on dev partition)
#SBATCH --time=8:00:00       # Set the max time limit (dev partition/QoS has 8 hr limit)

###-----load modules if needed-------###
module load singularity

###-----run script below this line-------###



Example

Arrays are a useful extension to sbatch jobs

NEWVAR=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" /home/${USER}/my_important_list.txt)

Array examples

Key Points

  • sbatch allows you to schedule jobs that will execute as soon as the requested resources are available.

  • Example headers can be found at https://github.com/TheJacksonLaboratory/slurm-templates

  • Common slurm batch file extensions include .sh and .slurm

  • Save the job ID into a file with output redirect


13 Job Statistics and Histories

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What types of historical information is available for my jobs?

Objectives
  • Understand how to use seff and sacct

Use seff to view information about a particular job.

Use sacct to view more information for more jobs:\

sacct -u ${USER} -S 2022-06-17 -oUser,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,CPUTime,nodelist

Key Points

  • sacct shows details on memory utilization (MaxRSS), but only for completed jobs.


14 Globus for Data Transfers

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • What is Globus?

  • When do I need to use Globus?

Objectives
  • Understand what Globus is used for.

Globus

Key Points

  • Use Globus to transfer files to and from HPC.

  • Use Globus to transfer files to and from Tier2.