01 Welcome and Introductions

Overview

Teaching: 5 min
Exercises: 0 min

Questions

Who is providing this training?

Objectives

Introduction to the instructors and helpers.

Welcome and Introductions

Welcome to this introduction to HPC training.
Our HPC environment is like your digital lab space. Depending on what you are doing and who you’re doing it with, we have a few different lab spaces (or “clusters”) for you to work in.

Cluster	Use case
Sumhpc	General Purpose Research
Winhpc	GPU-based Research
Other	Scientific Services Use Only

This training will cover:

How to connect to our HPC environment
HPC architecture
Types of software used
Slurm basics
Initial job profiling

What about when I have questions later?

We are going to direct you to lots of places to gather information.

Tutorials
Manpages
Community channels
Cheat sheets

Knowledge Checks

Throughout, we will be asking you a few knowledge questions. Don’t think of them as a pop quiz, instead think of them like this:

They highlight the take home message
In the moment feedback for you on what you understand or don’t
In the moment feedback for us on whether we’re being clear or not
An opportunity to be interactive

Key Points

Welcome

Overview of training

Resources Available

02 Connecting to (and navigating in) the JAX HPC environment

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How to connect to a remote system?

What do I do on the command line?

Objectives

Connect to the cluster via terminal and SSH

SSH Basics

The take home from this lesson.

Use the command ssh <username>@<remote server IP address|domain name> to connect to a remote server.

The details on how ssh works.

The (local) client asks the remote server (sumhpc login node) to establish a connection.
The server and the client work together to encrypt the contents of data transferred through the connection.

Example ssh connection for this class

Open terminal
type ssh <username>@<login node name> where use name is your username on the remote system and the login node name (provided by instructor).
Notice the first time you connect to a new machine you may get asked the following question (example listed below).

[user@edu-vm-63d1410f-2 ~]$ ssh user@35.227.70.171
The authenticity of host '35.227.70.171 (35.227.70.171)' can\'t be established.
ECDSA key fingerprint is SHA256:UTHv5IOvrF9uvxuh9Fo8uW2bx0BwCRLyrwHhONoiIj8.
ECDSA key fingerprint is MD5:15:bb:25:2a:3a:45:f4:c7:df:21:26:37:12:66:79:77.
Are you sure you want to continue connecting (yes/no)?

Now your connected to the remote system, commands issued into the prompt are now being executed on the remote system.
You are now on the login node notice the change in the prompt.
Do not run programs on the login nodes. We use the login node to access the compute nodes with srun and submit batch jobs with sbatch.
Lets take a look around the directories.
Take a look at the home directory:
ls ~
Take a look at the fastsrcatch directory:
ls /fastscratch
Take a look at the fastsrcatch directory:
ls /projects
You can remain logged in for the rest of the tutorial, but when your ready to exit, just issue the ```exit`` to the prompt.

SFTP

SFTP stands for secure file transfer protocol and is used to transfer files from one computer to another.

MobaXtem will automatically establish and sftp session when ssh’ing into a remote system. It is visible as a yellow globe on the left hand side of the MobaXterm terminal.

Key Points

ssh is the preferred way to connect to linux servers

ssh is secure and protects information sent

sftp available in a command line application and in GUI based applications for file transfers

02 From Computer to HPC

Overview

Teaching: 5 min
Exercises: 0 min

Questions

What are the components of a computer

What is the difference between a regular computer and an HPC

Objectives

Overview of General Computing

parts example

Computer Component Review

CPU: CPUs are the data process unit, they are composed of multiple cores. For legacy reasons software often refers the number of cores as the number of CPUs, so yeah that is confusing.
RAM (a.k.a MEMORY): RAM is fast digital storage. Most programs utilize RAM for access to data needed more than once. RAM is generally non-persistent when the powered off RAM memory is lost.
DISK: Disk is persistent digital storage that is not as fast as RAM. Disk storage can be made up of one or more disks such as hard drives (HDD) and/or Solid State Harddrives (SSD). Multiple disk can be configured together for increased performance and drive failure protection.
NETWORKING: Switches and network access cards within computers allow for computers to be networked together.
GPU: A Graphics Processing Unit (GPU) is a computer component that is capable of rendering graphics, these are also useful for conducting certain mathematical calculations.

Consumer Computer vs Servers vs HPC vs Sumhpc

Component	Home/Busines Computer	Server	Typical Individual Node in HPC	Typical Total HPC System	Individual Node on Sumhpc	Total Sumhpc System
CPU (cores)	4 - 8	12 - 128	32 - 128	1000s	70*	7,000
RAM(GB)	8 -16	64 - 960	240 - 3000	64,000	754 - 3TB	76.8 TB
DISK (TB)	.5 - 1 TB	8 - 100	None - 1 TB	100s (Networked)	NA	2.7 PB
Networking (Gbe)	.1 - 1	1 - 10	40 - 100	40 - 100	40	40 +

* note Sumhpc High Mem nodes contain 142 cores each and ~3TB of RAM.

HPCs are servers networked together and managed with a scheduler

cluster example

Key Points

HPCs are typically many large servers networked together

HPCs utilized networked disk space instead of local disk space

03 HPC Architecture Overview

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How is an HPC machine configured

Objectives

Overview of HPC architecture

HPC architecture overview

General

hpc example

Cluster

A cluster is a group of computers networked together.

The login node is where you connect to the cluster and submit jobs.

Administrative Nodes

The administrative nodes are what manage the cluster scheduling and other admin tasks, users do not login to these.

Compute Nodes

The compute nodes are where the computational tasks are carried out. Sumhpc includes 100 Supermicro X11DPT-B Series servers with 70 Intel Xeon Gold 6150 computable cores at 2.7GHz and 768 GB RAM. Sumhpc include 2 high-mem nodes with 142 Intel Xeon Gold 6150 cores at 2.7GHz and 3 TB RAM available per none for workloads that exceed 768 GB RAM.

Cluster Accessible Storage

Working storage (Tier 0, /fastscratch) on Sumhpc is provided by a Data Direct Networks Gridscalar GS7k GPFS storage appliance with 522 TB usable storage capacity.

Primary storage (Tier 1) on Sumhpc is provided 27 Dell EMC Isilon scale-out NAS nodes combined for a total of 2.7 PB raw storage capacity. Your home drive (50 Gb capacity) is mounted tier 1 as is lab project folders.

Other storage (Tier 2) this 2 PB of capacity is non-HPC storage and is not accessible to HPC resources 2 PB of new non-computable capacity. Access to tier 2 is through the Globus application (link globus).

Archival Storage:

Archival storage (Tier 3) is provided at both Bar Harbor, ME and Farmington, CT by 2 Quantum Artico StorNext appliances with 72 TB front-end disk capacity backed by a 4 PB tape library at each site. Data storage at this tier is replicated across both geographic sites.

Network Infrastructure:

Our network platform supports scientific and enterprise business systems, using 40Gb core switches in our server farm that delivers at least 1Gb to user devices. The environment includes wired and wireless network service and a redundant voice over IP (VOIP) system, and is protected by application firewalls. The network infrastructure for the HPC environment is comprised of a dedicated 100Gb backbone with 50Gb to each server in the cluster, and 40Gb to each storage node. Internet service is delivered by a commercial service provider that can scale beyond 10Gb as demand for data transfer increases.

Fall Cluster

the Fall cluster includes 8 Supermicro X11DGQ Series servers with 46 Intel Xeon Gold 6136 at 3.00GHz and 192 GB RAM. Each server includes 4 Nvidia Tesla V100 32 GB GPU nvme cards. This translates into 249.6 TFLOPS of double precision floating-point, 502.4 TFLOPS of single precision and 4,000 Tensor TFLOPS of combined, peak performance.

Key Points

Login to Login nodes and submit job request to schedule

HPC utilizes various tiers of storage optimized to different usage types

04 What is a Job

Overview

Teaching: 5 min
Exercises: 0 min

Questions

What is a job?

Objectives

Understand the difference in running jobs locally and on HPC system.

Understand what function a job serves in the HPC environment.

A job

Accessing HPC cluster resources requires requesting them from a scheduler.
Our scheduler is Slurm, other institution may use another scheduler.
A successful request (via srun or sbatch) to the scheduler creates a job.
This job is appended to the queue and is executed once resources become available (sometimes immediately).
The job is then transferred to the resources and executed (shell script for sbatch or terminal for srun).

Key Points

Understand the difference in running jobs locally and on HPC system.

Understand what function a job serves in the hpc environment.

06 Directories

Overview

Teaching: 20 min
Exercises: 0 min

Questions

What does the directory structure look like in the Sumhpc environment?

Where do I have space to store files?

What is the fastscratch and why is it erased?

How do I transfer files into and out of Sumhpc?

Are there additional storage locations ?

Objectives

Know the 3 writeable locations on Sumhpc: /home/${USER}, /fastscratch, /projects

Know files are removed from /fastscratch after 10 days.

Know that additional storage is located in on Tier2 and Tier3 (archive), but these are not directly accessible on the HPC system.

Know globus is used to transfer data into and out of the HPC environment

What you will need to know today

3 main directories for working on Sumhpc: /home/${USER}, /fastscratch, /projects
4 Tiers are utilized for data management (2 of these are accessible via HPC)

We utilize 4 Tiers of storage each optimized for specific role.

Tier	Mounted as	Backups?	Notes
Tier0	/fastscratch	None	shared space for temporary computational use
Tier1	/home & /projects	No*
Tier2	Not Mounted to HPC	Yes**	Used for storing “cooler” data, not meant for HPC computations
Tier3	Not Mounted to HPC	Yes**	Not accessible via Globus. Service desk ticket required for archival and retrieval.

* Tier1 is not backed up but does have hidden .snapshot directory for recovery of deleted files within 7 days (this is not a true backup).
** Tiers 2 & 3 have cross-site replication between cross-site replication between BH and CT.

globus is used to transfer between Tier2 and Tier1.

We have 3 general locations (directories) we use for the HPC

/home/${USER} is the users home directory, every user is given 50 GB for their home directory.
/fastscratch is a general working area for HPC data processing, data in this area is removed after 10 days. Total capacity 150TB
/projects is the location where faculty store data and programs they use in there research programs, access to the PIs folders in /projects requires PI approval.

Details and Suggestions for Directory Management for Your Review.

These are some suggestions for directory management

Directory Structure Best Practice

Time spent at the beginning of the project to define folder hierarchy and file naming conventions will make it much easier to keep things organized and findable, both throughout the project and after project completion. Adhering to well-thought out naming conventions:

helps prevent accidental overwrites or deletion

makes it easier to locate specific data files

makes collaborating on the same files less confusing

File naming best practices

Include a few pieces of descriptive information in the filename, in a standard order, to make it clear what the file contains. For example, filenames could include:

experiment name or acronym

researcher initials

date data collected

type of data

conditions

file version

file extension for application-specific files

Consider sort order:

If it is useful for files to stay in chronological order, a good convention is to start file names with YYYYMMDD or YYMMDD.

If you are using a sequential numbering system, use leading zeros to maintain sort order, e.g. 007 will sort before 700.

Do not use special (i.e. non-alphanumeric) characters in names such as:
" / \ : * ? ‘ < > [ ] [ ] { } ( ) & $ ~ ! @ # % ^ , '

These could be interpreted by programs or operating systems in unexpected ways.

Do not use spaces in file or folder names, as some operating systems will not recognize them and you will need to enclose them in quotation marks to reference them in scripts and programs. Alternatives to spaces in filenames:

Underscores, e.g. file_name.xxx

Dashes, e.g. file-name.xxx

No separation, e.g. filename.xxx

Camel case, where the first letter of each section of text is capitalized, e.g. FileName.xxx

Keep names short, no more than 25 characters.

File Versioning Best Practices

File versioning ensures that you always understand what version of a file you are working with, and what are the working and final versions of files. Recommended file versioning practices:

Include a version number at the end of the file name such as v01. Change this version number each time the file is saved.

For the final version, substitute the word FINAL for the version number.

Take advantage of the versioning capabilities available in collaborative workspaces such as github OSF, Google Drive, and Box.

Track versions of computer code with versioning software such as Git, Subversion, or CVS.

Directory Structure Best Practices

Directories can be organized in many different ways. Consider what makes sense for your project and research team, and how people new to the project might look for files.

Once you determine how you want your directories to be organized, it is a good idea to stub out an empty directory structure to hold future data, and to document the contents of each directory in a readme file.

Directory Best Practices

Organize directories hierarchically, with broader topics at the top level of the hierarchy and more specific topics lower in the structure.

Group files of similar information together in a single directory.

Name directories after aspects of the project rather than after the names of individual researchers.

Once you have decided on a directory structure, follow it consistently and audit it periodically.

Separate ongoing and completed work.

Sources

http://guides.lib.umich.edu/datamanagement/files

Key Points

The 3 writeable locations on Sumhpc: /home/${USER}, /fastscratch, /projects

Files are removed from /fastscratch after 10 days.

Faculty must approve access to their /projects directory.

Good directory management can be helpful for organization.

Globus is used to transfer data into and out of the HPC environment (see lesson 14)

07 Slurm Basics

Overview

Teaching: 10 min
Exercises: 0 min

Questions

What is Slurm

Objectives

Introduction to basic Slurm commands and parameters

General Slurm Commands and Parameters

Usage	Slurm Commands
submit job	srun, sbatch, salloc
view cluster status	squeue, sinfo
view cluster config	‘scontrol show partition’ sacctmgr
view job information (history)	sacct, seff, sstat
cancel job	‘scontrol -f <JobID>’

Slurm documentation can be found here: https://slurm.schedmd.com/documentation.html

Usage: Submitting Jobs to Slurm

srun is the command for requesting an interactive job.
sbatch is the command for requesting a batch job.
salloc is another command for requesting an interactive job.

Usage: Viewing the current status of cluster

squeue shows the whole queue:
squeue
squeue shows the your jobs in the queue:
squeue -u $USER
sinfo shows the cluster partitions and node utilization:
sinfo

Usage: View cluster config

The commands below have multiple parts the whole command within the quotes should be executed.

‘scontrol show partition’ shows the partitions:
scontrol show partition
‘sacctmgr show qos format=”name%-20,maxjobspu%-12,maxtrespu%-30,maxwall%-20”’ shows the QoS quality of service:
sacctmgr show qos format="name%-20,maxjobspu%-12,maxtrespu%-30,maxwall%-20"

Usage: View job information

sacct
seff

Usage: Cancel job

‘scontrol -f <JobID>’ cancels job

Common Slurm Terms and Parameters for srun and sbatch

Common Name	- flag	--flag	Definition/notes
Account	-A	--account
Array (sbatch only)	-a	--array	execute multiple jobs with one job request
CPUsPerTask	-c	--cpus-per-task	Usually set to number of CPUs/threads req.
Dependency	-d	--dependency	Make job execution dependent on previous job
Pass Variable		--export	Pass BASH variable to job
Print help	-h	--help
Hold	-H	--hold	Queue job but do not run
JobName	-J	--job-name	Set job name
Memory		--mem	Set memory (RAM) requirement for job
Request specific nodes	-w	--nodelist
Number of nodes per job	-N	--nodes	Usually set to 1 (may vary w/advanced usage)
Number tasks per node	-n	--ntasks	Usually set to 1 (may vary w/advanced usage)
Output	-o	--output
Partition	-p	--partition
QOS	-q	--qos
Time Limit	-t	--time	Set limit on job run time

Slurm Terms and Parameters for sacct

Common Name	- flag	--flag	Definition/notes
JobID	-j	–jobs	display information about job or jobs
UserID	-u
Output Format	-o	–format
Output Format help	-e	–helpformat	Print a list of fields for –format option
Output Format long style	-l	–long	long format option
Jobs included after	-S	–starttime
Jobs included before	-E	–endtime
UserID	-u	–uid
State of Jobs	-s	–state
Nodes	-N	–nodelist	Print jobs ran on these nodes

Usefull sacct format fields

Common Name	Description
AllocCPUS
AllocNodes
CPUTime
Elapsed
JobID	Jobs JobID
JobName	Job Name
MaxRSS	Max Memory Used for job
UserID
NCPUS
Partition	Partition requested
QOS	QOS requested
ReqMem	MEM requested
Start	Start time of job
End	End time of job
State	State of job
ExitCode	Exit code from job

Key Points

srun can access an interactive job

sbatch executes batch job

srun and sbatch have similar parameters

scontrol and saccmgr can access the Sumhpc partition and QoS configurations

sacct, seff, stat can provide access to my job history and resource utilization

08 Sumhpc

Overview

Teaching: 10 min
Exercises: 0 min

Questions

What is Sumhpc

Objectives

Review of Sumhpc nodes

Sumhpc Basics

Sets of nodes are called partitions within slurm.
We have 3 sets of nodes (a.k.a. partitions) each with distict properties:
compute : General compute nodes 70 core and 754GB each.
high_mem: For jobs with large RAM requirements (job requiring more than 754GB of RAM), these nodes have 142 processors each and almost 3TB of RAM each.
dev : Dev nodes have faster processors but less memory than compute and high_mem nodes, these nodes have 30 processors each and 180GB of RAM each.

	compute	high_mem	dev
# of nodes	100	2	20
Usable memory per node	754GB	3022GB*	180GB
Usable cores per node	70	142	30
CPU core speed	2.70GHz	2.70GHz	3.60GHz
Total partition MEM	76T	6T	3.6T
Total partition cores	7000	284	600
how to use?	-p compute	-p high_mem	-p dev & -q dev

*Note: The max available memory on the high_mem nodes is 3022GB, just shy of 3TB (3072GB binary unit), also the ‘--mem’ value is required to be an integer (no decimals).

**Note: To use the dev partition you must use both the -p dev and -q dev when submitting jobs. See QOS.

Key Points

Sumhpc has 3 compute node types compute, high_mem, and dev.

The compute node types refer to partitions in the slurm configuration.

09 Sumhpc Slurm Configuration

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How is Sumhpc configured with Slurm

How do partitions, QoS, and user set parameters work together?

Objectives

Understand how partitions, QoS, and user set parameters work together

A Layer Model to Slurm Configuration (build from bottom up)

Layer	SLURM TERM	Example(s)	Notes
Layer 4	user set parameters	-N, -n, -c, --mem --time	user defines what they need for particular job
Layer 3	GRES	-	optional, GPU cluster only for now, not needed for intro-to-hpc
Layer 2	QoS	batch, long, dev	user limits set by cluster administration
Layer 1	Partition	compute, dev, high_mem	the hardware (based upon hardware properties)

Partition is the set of Nodes you would like to run on.
There is a partition named dev and a QoS named dev.
QoS is administrative limits to the job, you can pick what QoS (set of limits) work best for your job.
GRES (optional, GPU cluster only for now, not needed for intro-to-hpc)
User provided configuration (-N/–nodes, -c)

Sumhpc’s Partitions

	compute	high_mem	dev
# of nodes	100	2	20
Usable memory per node	754GB	3022GB*	180GB
Usable cores per node	70	142	30
CPU core speed	2.70GHz	2.70GHz	3.60GHz
Total partition MEM	76T	6T	3.6T
Total partition cores	7000	284	600
MaxTime	14-00:00:00	3-00:00:00	8:00:00
how to use?	-p compute	-p high_mem	-p dev & -q dev

Sumhpc’s QoS.

	batch	long	dev
Max Wallltime per job	3 days	14 days	8 hours
Default Walltime (if not specified)	1 hour	1 hour	1 hour
Max CPU per user	700 cores	140 cores	60 cores
Default CPU (if not specified)	1 core	1 core	1 core
Max Memory per user	7.6TB	1TB	360GB
Default Memory (if not specified)	1GB	1GB	1GB
Max running jobs per user	700	10	60
how to use?	-q batch	-q long	-p dev & -q dev
Partitions Allowed On	compute, high_mem	compute, *	dev

* long QoS is currently allowed on high_mem, but the high_mem partition only allows jobs to run for 3-00:00:00 (72 hr), so just use batch QoS. Also, long QoS is limited to 1TB RAM.

Only certain QoS’s are allowed on specific partitions.
compute partition only allows long and batch QoS.
dev partition only allows dev QoS.

Important srun and sbatch parameters these are set per job.

Common Name	- flag	--flag	Definition/notes
CPUsPerTask	-c	--cpus-per-task	Usually set to number of CPUs/threads req.
Memory		--mem	Set memory (RAM) requirement for job
Request specific nodes	-w	--nodelist
Partition	-p	--partition
QOS	-q	--qos
Number of nodes per job	-N	--nodes	Usually set to 1 (may vary w/advanced usage)
Number tasks per node	-n	--ntasks	Usually set to 1 (may vary w/advanced usage)
Time Limit	-t	--time	Set limit on job run time

we usually set the number of nodes to 1 for each job, but we can run multiple jobs at at time

Key Points

Partitions are sets of nodes with specific hardware configurations.

QoS are limits to resource (it is a shared systems).

Users also set specific parameters so the resources allocated by the scheduler match what is needed.

Job queue wait times are optimized when slurm configuration parameters are optimal for job.

10 Interactive SRUN Jobs

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How to run an interactive job with srun

Objectives

Introduction to srun interactive jobs.

Introduction to setting slurm parameters.

Highlight the how program resource parameters still need to be set.

Slurm srun Basics

To get an interactive node with srun (no MPI)

Default srun request small tasks only (1 hr time limit, 1 core, 1 GB, compute partition & batch QoS).
srun --pty bash
General format for srun [do not execute without changing values]
srun -p <partition> -q <QoS> -N <Number of Nodes> -n <Number of tasks> -c <cores> --time <time> --pty bash
Single-core srun interactive job on dev partition and dev QoS with 4 cpus and 8GB RAM for 4 hrs.
srun -p dev -q dev -N 1 -n 1 -c 1 --mem 8GB --time 4:00:00 --pty bash
Sing the srun jingle … srun -p dev -q dev -N 1 -n 1 -c 1 --mem 10GB --pty bash
Additional examples for srun can be found here: https://github.com/TheJacksonLaboratory/slurm-templates

Hands on srun example

Run srun: srun --pty bash
Notice the prompt changed to a specific node and is no longer the login node, this means your free to run programs (within your requested parameters).
2 challenges to look around on the node.

Look at the node CPU information

Lets take a look at the CPU(s) information on the node using the lscpu command

$ lscpu

Result

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Stepping:              4
CPU MHz:               1199.871
CPU max MHz:           3700.0000
CPU min MHz:           1200.0000
BogoMIPS:              5400.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d

Note that you see all 72 CPUs (0-71) but you only requested 1 CPU (core), this is why it is important that NCPU parameters are set in scripts (e.g. R, python) and programs in addition to the srun and sbatch number of cpu parameters (-c). Some programs will look at the total number of CPUs, see all 72 these, and try to use all 72, although your not allocated all the CPUs. Setting the number of cpus in the srun/sbatch parameters is not sufficient for limiting the number of CPUs programs will try to access. Using more CPUs than allocated will impact other users any job seen doing this will be terminated.

Look at the Node Available RAM (Memory)

Lets take a look at the available memory on the node using the free -h command
$ free -h 
Result
             total        used        free      shared  buff/cache   available
Mem:           754G         16G        292G        789M        445G        734G
Swap:           63G        319M         63G

Note that you see all the MEM on the node but you only requested 1 GB, this is why it is important that you do not try to use more MEM that requested. Setting the MEM in the slurm parameters is not sufficient for limiting the amount of MEM a program will try to access. Using more MEM than allocated will impact other jobs any jobs seen doing this will be terminated.
Now exit the interactive node with the exit command.
your prompt should change back to the login node.

Key Points

Set slurm run parameters correctly.

Set your script/program’s core and mem parameters correctly.

11 Software on Sumhpc

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How do we access software on Sumhpc

Objectives

Identify where software can be installed

How to access low-level software such as libraries, compilers, Singularity, and programming language binaries with Environment Modules.

Identify how and when to use singularity.

Software Usage on the SumHPC Cluster

On Sumhpc, we have overhauled how researchers are able to install and easily access the software they need. Users are still able to download and install software into their userspace (/projects or /home directories). Additionally, low level development tools such as gcc, openMPI, openJDK, and basic libraires are still available via the module system. The new and exciting way we are able to provide software on Sumhpc, however, is through software containers. Singularity containers are a cutting edge way of creating your own custom software modules. Containers not only provide you with the software you need, but also a contained environment which ensure that your software runs the exact same way whether it’s on your laptop, on the HPC resources, with a collaborator, or in the cloud.

Userspace Installation

The simplest method of installing software for use with Sumhpc is to install directly into your userspace. This includes any directories you have permissions to write to and modify. This includes your group’s /projects directory or your personal /home directory. You also have write permissions on /fastscratch and /tmp, but these are not recommended due to their ephemeral nature. Most software installations will default to writing under a /usr/ or /lib/ path. Since these locations are shared by everyone, these “global” trees are not writable by the typical HPC user. Instead, many software installations will allow you to perform a “local” install in a directory of your choosing. Please consult the documentation for your software to find how to change the install path for your software.

Environment Modules

Now, only low-level software such as libraries, compilers, Singularity, and programming language binaries are available through modules. To see what modules are available, run the command module avail. For more information about how to use the module command, see the documentation.

Singularity Containers

The newest and most liberating change to software installations on HPC is the introduction of Singularity. Containerization allows you to install whatever software you want inside of a singular, stand-alone Singularity image file (.sif). This file contains your software, custom environment, and metadata all in one.

To use Singularity, all you have to do is load the module by running module load singularity in a running job session on Sumhpc. Singularity can download new containers (singularity pull), run existing ones (singularity run/exec/shell).

Software Hands On Example.

First grab an interactive job with srun:
srun -p dev -q dev --time 2:00:00 --pty bash
View the available modules with module avail:
module avail
Lets load a module:
module load python36
Lets check and see what version of singularity is loaded by default:
singularity version
Lets make a new directory (called ‘newdir’) on /fastscratch and navigate to it:
mkdir -p /fastscratch/${USER}/newdir
cd /fastscratch/${USER}/newdir
Verify your in the correct directory:
pwd
View directory files:
ls
Pull an docker image for samtools with singularity:
singularity pull staphb_samtools_1.15.sif docker://staphb/samtools:1.15
View new sif file:
ls
Run the samtools program (just print the help statement):
singularity exec staphb_samtools_1.15.sif samtools --help
Run the samtools view program (just print the help statement):
singularity exec staphb_samtools_1.15.sif samtools view --help
link to dockerhub samtools source: https://hub.docker.com/r/staphb/samtools/tags

Key Points

Users are able to download and install software into their userspace (/projects or /home directories)

Low-level software such as libraries, compilers, Singularity, and programming language binaries with Environment Modules.

Singularity is a new and liberating change to software installations on HPC.

12 Batch Jobs with sbatch

Overview

Teaching: 10 min
Exercises: 0 min

Questions

What are the advantages of sbatch over srun?

How do I run a sbatch job?

Where can I find example headers?

How do I format sbatch headers?

How do I capture the jobid?

Objectives

Know how to format a sbatch script.

Know how to run a sbatch job.

Know how to capture the jobid.

Understand the usefulness of slurm arrays.

sbatch basics

The Slurm sbatch header

#!/bin/bash

#SBATCH --job-name=MY_JOB    # Set job name
#SBATCH --partition=dev      # Set the partition 
#SBATCH --qos=dev            # Set the QoS
#SBATCH --nodes=1            # Do not change unless you know what your doing (it set the nodes (do not change for non-mpi jobs))
#SBATCH --ntasks=1           # Do not change unless you know what your doing (it sets the number of tasks (do not change for non-mpi jobs))
#SBATCH --cpus-per-task=4    # Set the number of CPUs for task (change to number of CPU/threads utilized) [-p dev -q dev limited to 30 CPUs per node]
#SBATCH --mem=24GB           # Set to a value ~10-20% greater than max amount of memory the job will use (or ~6 GB per core, for dev) (limited to 180 GB per node on dev partition)
#SBATCH --time=8:00:00       # Set the max time limit (dev partition/QoS has 8 hr limit)

###-----load modules if needed-------###
module load singularity

###-----run script below this line-------###

note the first line is the location of the bash executable
sbatch parameters are then set with the #SBATCH lines
your code is entered in the black space below ‘run script below this line’.
common slurm batch file extensions include .sh and .slurm.
Slurm documentation for sbatch can be found at https://slurm.schedmd.com/sbatch.html .
Sumhpc example headers can be found at https://github.com/TheJacksonLaboratory/slurm-templates .

Example

Lets make a new directory (called ‘adir’) on fastscratch and navigate to it:
mkdir -p /fastscratch/${USER}/adir
cd /fastscratch/${USER}/adir
Verify your in the correct directory:
pwd
Now grab the batch job templates:
wget https://github.com/TheJacksonLaboratory/slurm-templates/archive/refs/heads/main.zip
Unzip the templates:
unzip main.zip
Navigate into the new directory slurm-templates:
cd slurm-templates-main
Use cat to view the template example slurm_template_02_compute_batch.sh:
cat slurm_template_02_compute_batch.sh
Navigate into the workshop examples:
cd workshop_examples
Use cat to view the workshop example slurm_00_sleep_dev_dev.sh :
cat slurm_00_sleep_dev_dev.sh
Now before we submit the script, let’s see if we are running anything else:
squeue -u ${USER}
Note after we submit the script we can rerun squeue -u ${USER} to see its running state: \
Now submit the script with the sbatch command:
sbatch slurm_00_sleep_dev_dev.sh
You will get a job number printed to the screen: \
Verify it is running with squeue -u ${USER}
Now lets do it again but this time pipe the job id to a file so we have the jobid for later:
sbatch slurm_00_sleep_dev_dev.sh &> my_job_id.txt
Notice nothing was printed to screen:\
Verify it is running with squeue:
squeue -u ${USER}
View the jobid with cat:
cat my_job_id.txt

Arrays are a useful extension to sbatch jobs

Arrays are extremely useful in parallelizing tasks that only change by 1 item. For example aligning multiple DNA fastq files with bwa.
Arrays are scheduled like 1 job but turn into multiple jobs as determined by the --array parameter, every element in the array parameter gets its on job.
Arrays jobs differ by the array element (a number) passed to the job script in the ${SLURM_ARRAY_TASK_ID} variable.
The ${SLURM_ARRAY_TASK_ID} variable is used within the array’s slurm script to vary the code within the script’s parameters.
A common array method is to select a line of text from a text file from a list and assign it to a new variable:

NEWVAR=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" /home/${USER}/my_important_list.txt)

the list should be 1 item per line and be unix formatted

Array examples

View the example list list_lib_names.txt:
cat list_lib_names.txt
View the example sbatch array script slurm_01_array_dev_dev.sh:
cat slurm_01_array_dev_dev.sh
View the content of our working directory:
ls
Run the array job:
sbatch slurm_01_array_dev_dev.sh &> my_array_01_job_id.txt
Verify it is running with squeue.
squeue -u ${USER}
View the jobid with cat.
cat my_array_01_job_id.txt
Note that the array only prints one jobid to the output, but many were viewed as running.\
View array script output files:
ls
View the output with cat.
cat array_run1_1_out_SRR2062637_file_names.txt
View the example sbatch array without throttle script slurm_02_array_dev_dev.sh:
cat slurm_02_array_dev_dev.sh
Run the array job:
sbatch slurm_02_array_dev_dev.sh &> my_array_02_job_id.txt
Verify it is running with squeue.
squeue -u ${USER}
View the output with cat.
cat array_run2_1_out_SRR2062637_file_names.txt

Key Points

sbatch allows you to schedule jobs that will execute as soon as the requested resources are available.

Example headers can be found at https://github.com/TheJacksonLaboratory/slurm-templates

Common slurm batch file extensions include .sh and .slurm

Save the job ID into a file with output redirect

13 Job Statistics and Histories

Overview

Teaching: 10 min
Exercises: 0 min

Questions

What types of historical information is available for my jobs?

Objectives

Understand how to use seff and sacct

Use `seff` to view information about a particular job.

seff usage: seff <jobid>
requires jobid
Get previous jobid from output file:
ls
Run seff with jobid.
seff <jobid>

Use `sacct` to view more information for more jobs:\

sacct -u ${USER} -S 2022-06-17 -oUser,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,CPUTime,nodelist

Key Points

sacct shows details on memory utilization (MaxRSS), but only for completed jobs.

14 Globus for Data Transfers

Overview

Teaching: 5 min
Exercises: 0 min

Questions

What is Globus?

When do I need to use Globus?

Objectives

Understand what Globus is used for.

Globus

Globus is used to transfer data to and from the HPC environment.
Globus is used to move data to and from Tier2.
Please see workshop video on how to use Globus (emailed after workshop).

Key Points

Use Globus to transfer files to and from HPC.

Use Globus to transfer files to and from Tier2.

01 Welcome and Introductions

Overview

Welcome and Introductions

This training will cover:

What about when I have questions later?

Knowledge Checks

Key Points

02 Connecting to (and navigating in) the JAX HPC environment

Overview

SSH Basics

SFTP

Key Points

02 From Computer to HPC

Overview

Computer Component Review

Consumer Computer vs Servers vs HPC vs Sumhpc

HPCs are servers networked together and managed with a scheduler

Key Points

03 HPC Architecture Overview

Overview

HPC architecture overview

General

Cluster

Login Node

Administrative Nodes

Compute Nodes

Cluster Accessible Storage

Archival Storage:

Network Infrastructure:

Fall Cluster

Key Points

04 What is a Job

Overview

A job

Key Points

06 Directories

Overview

What you will need to know today

We utilize 4 Tiers of storage each optimized for specific role.

We have 3 general locations (directories) we use for the HPC

Details and Suggestions for Directory Management for Your Review.

Directory Structure Best Practice

File naming best practices

Consider sort order:

File Versioning Best Practices

Directory Structure Best Practices

Sources

Key Points

07 Slurm Basics

Overview

General Slurm Commands and Parameters

Usage: Submitting Jobs to Slurm

Usage: Viewing the current status of cluster

Usage: View cluster config

Usage: View job information

Usage: Cancel job

Common Slurm Terms and Parameters for srun and sbatch

Slurm Terms and Parameters for sacct

Usefull sacct format fields

Key Points

08 Sumhpc

Overview

Sumhpc Basics

Key Points

09 Sumhpc Slurm Configuration

Overview

A Layer Model to Slurm Configuration (build from bottom up)

Sumhpc’s Partitions

Sumhpc’s QoS.

Important srun and sbatch parameters these are set per job.

Key Points

10 Interactive SRUN Jobs

Overview

Slurm srun Basics

To get an interactive node with srun (no MPI)

Hands on srun example

Look at the node CPU information

Result

Look at the Node Available RAM (Memory)

Result

Use `seff` to view information about a particular job.

Use `sacct` to view more information for more jobs:\