01 Welcome and Introductions
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Who is providing this training?
Objectives
Introduction to the instructors and helpers.
Welcome and Introductions
Welcome to this introduction to HPC training.
Our HPC environment is like your digital lab space. Depending on what you are doing and who you’re doing it with, we have a few different lab spaces (or “clusters”) for you to work in.
Cluster | Use case |
---|---|
Sumhpc | General Purpose Research |
Winhpc | GPU-based Research |
Other | Scientific Services Use Only |
This training will cover:
- How to connect to our HPC environment
- HPC architecture
- Types of software used
- Slurm basics
- Initial job profiling
What about when I have questions later?
We are going to direct you to lots of places to gather information.
- Tutorials
- Manpages
- Community channels
- Cheat sheets
Knowledge Checks
Throughout, we will be asking you a few knowledge questions. Don’t think of them as a pop quiz, instead think of them like this:
- They highlight the take home message
- In the moment feedback for you on what you understand or don’t
- In the moment feedback for us on whether we’re being clear or not
- An opportunity to be interactive
Key Points
Welcome
Overview of training
Resources Available
02 Connecting to (and navigating in) the JAX HPC environment
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How to connect to a remote system?
What do I do on the command line?
Objectives
Connect to the cluster via terminal and SSH
SSH Basics
The take home from this lesson.
- Use the command
ssh <username>@<remote server IP address|domain name>
to connect to a remote server.
The details on how ssh works.
- The (local) client asks the remote server (sumhpc login node) to establish a connection.
- The server and the client work together to encrypt the contents of data transferred through the connection.
Example ssh connection for this class
-
Open terminal
-
type
ssh <username>@<login node name>
where use name is your username on the remote system and the login node name (provided by instructor). -
Notice the first time you connect to a new machine you may get asked the following question (example listed below).
[user@edu-vm-63d1410f-2 ~]$ ssh user@35.227.70.171
The authenticity of host '35.227.70.171 (35.227.70.171)' can\'t be established.
ECDSA key fingerprint is SHA256:UTHv5IOvrF9uvxuh9Fo8uW2bx0BwCRLyrwHhONoiIj8.
ECDSA key fingerprint is MD5:15:bb:25:2a:3a:45:f4:c7:df:21:26:37:12:66:79:77.
Are you sure you want to continue connecting (yes/no)?
-
Now your connected to the remote system, commands issued into the prompt are now being executed on the remote system.
-
You are now on the login node notice the change in the prompt.
-
Do not run programs on the login nodes. We use the login node to access the compute nodes with
srun
and submit batch jobs withsbatch
. -
Lets take a look around the directories.
-
Take a look at the home directory:
ls ~
-
Take a look at the fastsrcatch directory:
ls /fastscratch
-
Take a look at the fastsrcatch directory:
ls /projects
-
You can remain logged in for the rest of the tutorial, but when your ready to exit, just issue the ```exit`` to the prompt.
SFTP
SFTP stands for secure file transfer protocol and is used to transfer files from one computer to another.
MobaXtem will automatically establish and sftp session when ssh’ing into a remote system. It is visible as a yellow globe on the left hand side of the MobaXterm terminal.
Key Points
ssh is the preferred way to connect to linux servers
ssh is secure and protects information sent
sftp available in a command line application and in GUI based applications for file transfers
02 From Computer to HPC
Overview
Teaching: 5 min
Exercises: 0 minQuestions
What are the components of a computer
What is the difference between a regular computer and an HPC
Objectives
Overview of General Computing
Computer Component Review
-
CPU: CPUs are the data process unit, they are composed of multiple cores. For legacy reasons software often refers the number of cores as the number of CPUs, so yeah that is confusing.
-
RAM (a.k.a MEMORY): RAM is fast digital storage. Most programs utilize RAM for access to data needed more than once. RAM is generally non-persistent when the powered off RAM memory is lost.
-
DISK: Disk is persistent digital storage that is not as fast as RAM. Disk storage can be made up of one or more disks such as hard drives (HDD) and/or Solid State Harddrives (SSD). Multiple disk can be configured together for increased performance and drive failure protection.
-
NETWORKING: Switches and network access cards within computers allow for computers to be networked together.
-
GPU: A Graphics Processing Unit (GPU) is a computer component that is capable of rendering graphics, these are also useful for conducting certain mathematical calculations.
Consumer Computer vs Servers vs HPC vs Sumhpc
Component | Home/Busines Computer | Server | Typical Individual Node in HPC | Typical Total HPC System | Individual Node on Sumhpc | Total Sumhpc System |
---|---|---|---|---|---|---|
CPU (cores) | 4 - 8 | 12 - 128 | 32 - 128 | 1000s | 70* | 7,000 |
RAM(GB) | 8 -16 | 64 - 960 | 240 - 3000 | 64,000 | 754 - 3TB | 76.8 TB |
DISK (TB) | .5 - 1 TB | 8 - 100 | None - 1 TB | 100s (Networked) | NA | 2.7 PB |
Networking (Gbe) | .1 - 1 | 1 - 10 | 40 - 100 | 40 - 100 | 40 | 40 + |
* note Sumhpc High Mem nodes contain 142 cores each and ~3TB of RAM.
HPCs are servers networked together and managed with a scheduler
Key Points
HPCs are typically many large servers networked together
HPCs utilized networked disk space instead of local disk space
03 HPC Architecture Overview
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How is an HPC machine configured
Objectives
Overview of HPC architecture
HPC architecture overview
General
Cluster
A cluster is a group of computers networked together.
Login Node
The login node is where you connect to the cluster and submit jobs.
Administrative Nodes
The administrative nodes are what manage the cluster scheduling and other admin tasks, users do not login to these.
Compute Nodes
The compute nodes are where the computational tasks are carried out. Sumhpc includes 100 Supermicro X11DPT-B Series servers with 70 Intel Xeon Gold 6150 computable cores at 2.7GHz and 768 GB RAM. Sumhpc include 2 high-mem nodes with 142 Intel Xeon Gold 6150 cores at 2.7GHz and 3 TB RAM available per none for workloads that exceed 768 GB RAM.
Cluster Accessible Storage
Working storage (Tier 0, /fastscratch) on Sumhpc is provided by a Data Direct Networks Gridscalar GS7k GPFS storage appliance with 522 TB usable storage capacity.
Primary storage (Tier 1) on Sumhpc is provided 27 Dell EMC Isilon scale-out NAS nodes combined for a total of 2.7 PB raw storage capacity. Your home drive (50 Gb capacity) is mounted tier 1 as is lab project folders.
Other storage (Tier 2) this 2 PB of capacity is non-HPC storage and is not accessible to HPC resources 2 PB of new non-computable capacity. Access to tier 2 is through the Globus application (link globus).
Archival Storage:
Archival storage (Tier 3) is provided at both Bar Harbor, ME and Farmington, CT by 2 Quantum Artico StorNext appliances with 72 TB front-end disk capacity backed by a 4 PB tape library at each site. Data storage at this tier is replicated across both geographic sites.
Network Infrastructure:
Our network platform supports scientific and enterprise business systems, using 40Gb core switches in our server farm that delivers at least 1Gb to user devices. The environment includes wired and wireless network service and a redundant voice over IP (VOIP) system, and is protected by application firewalls. The network infrastructure for the HPC environment is comprised of a dedicated 100Gb backbone with 50Gb to each server in the cluster, and 40Gb to each storage node. Internet service is delivered by a commercial service provider that can scale beyond 10Gb as demand for data transfer increases.
Fall Cluster
the Fall cluster includes 8 Supermicro X11DGQ Series servers with 46 Intel Xeon Gold 6136 at 3.00GHz and 192 GB RAM. Each server includes 4 Nvidia Tesla V100 32 GB GPU nvme cards. This translates into 249.6 TFLOPS of double precision floating-point, 502.4 TFLOPS of single precision and 4,000 Tensor TFLOPS of combined, peak performance.
Key Points
Login to Login nodes and submit job request to schedule
HPC utilizes various tiers of storage optimized to different usage types
04 What is a Job
Overview
Teaching: 5 min
Exercises: 0 minQuestions
What is a job?
Objectives
Understand the difference in running jobs locally and on HPC system.
Understand what function a job serves in the HPC environment.
A job
- Accessing HPC cluster resources requires requesting them from a scheduler.
- Our scheduler is Slurm, other institution may use another scheduler.
- A successful request (via srun or sbatch) to the scheduler creates a job.
- This job is appended to the queue and is executed once resources become available (sometimes immediately).
- The job is then transferred to the resources and executed (shell script for sbatch or terminal for srun).
Key Points
Understand the difference in running jobs locally and on HPC system.
Understand what function a job serves in the hpc environment.
06 Directories
Overview
Teaching: 20 min
Exercises: 0 minQuestions
What does the directory structure look like in the Sumhpc environment?
Where do I have space to store files?
What is the fastscratch and why is it erased?
How do I transfer files into and out of Sumhpc?
Are there additional storage locations ?
Objectives
Know the 3 writeable locations on Sumhpc: /home/${USER}, /fastscratch, /projects
Know files are removed from /fastscratch after 10 days.
Know that additional storage is located in on Tier2 and Tier3 (archive), but these are not directly accessible on the HPC system.
Know globus is used to transfer data into and out of the HPC environment
What you will need to know today
-
3 main directories for working on Sumhpc: /home/${USER}, /fastscratch, /projects
-
4 Tiers are utilized for data management (2 of these are accessible via HPC)
We utilize 4 Tiers of storage each optimized for specific role.
Tier | Mounted as | Backups? | Notes |
---|---|---|---|
Tier0 | /fastscratch | None | shared space for temporary computational use |
Tier1 | /home & /projects | No* | |
Tier2 | Not Mounted to HPC | Yes** | Used for storing “cooler” data, not meant for HPC computations |
Tier3 | Not Mounted to HPC | Yes** | Not accessible via Globus. Service desk ticket required for archival and retrieval. |
* Tier1 is not backed up but does have hidden .snapshot directory for recovery of deleted files within 7 days (this is not a true backup).
** Tiers 2 & 3 have cross-site replication between cross-site replication between BH and CT.
- globus is used to transfer between Tier2 and Tier1.
We have 3 general locations (directories) we use for the HPC
-
/home/${USER}
is the users home directory, every user is given 50 GB for their home directory. -
/fastscratch
is a general working area for HPC data processing, data in this area is removed after 10 days. Total capacity 150TB -
/projects
is the location where faculty store data and programs they use in there research programs, access to the PIs folders in/projects
requires PI approval.
Details and Suggestions for Directory Management for Your Review.
These are some suggestions for directory management
Directory Structure Best Practice
Time spent at the beginning of the project to define folder hierarchy and file naming conventions will make it much easier to keep things organized and findable, both throughout the project and after project completion. Adhering to well-thought out naming conventions:
- helps prevent accidental overwrites or deletion
- makes it easier to locate specific data files
- makes collaborating on the same files less confusing
File naming best practices
Include a few pieces of descriptive information in the filename, in a standard order, to make it clear what the file contains. For example, filenames could include:
- experiment name or acronym
- researcher initials
- date data collected
- type of data
- conditions
- file version
- file extension for application-specific files
Consider sort order:
If it is useful for files to stay in chronological order, a good convention is to start file names with
YYYYMMDD
orYYMMDD
.If you are using a sequential numbering system, use leading zeros to maintain sort order, e.g.
007
will sort before700
.Do not use special (i.e. non-alphanumeric) characters in names such as:
" / \ : * ? ‘ < > [ ] [ ] { } ( ) & $ ~ ! @ # % ^ , '
These could be interpreted by programs or operating systems in unexpected ways.
Do not use spaces in file or folder names, as some operating systems will not recognize them and you will need to enclose them in quotation marks to reference them in scripts and programs. Alternatives to spaces in filenames:
- Underscores, e.g. file_name.xxx
- Dashes, e.g. file-name.xxx
- No separation, e.g. filename.xxx
- Camel case, where the first letter of each section of text is capitalized, e.g. FileName.xxx
- Keep names short, no more than 25 characters.
File Versioning Best Practices
File versioning ensures that you always understand what version of a file you are working with, and what are the working and final versions of files. Recommended file versioning practices:
- Include a version number at the end of the file name such as v01. Change this version number each time the file is saved.
- For the final version, substitute the word FINAL for the version number.
- Take advantage of the versioning capabilities available in collaborative workspaces such as github OSF, Google Drive, and Box.
- Track versions of computer code with versioning software such as Git, Subversion, or CVS.
Directory Structure Best Practices
Directories can be organized in many different ways. Consider what makes sense for your project and research team, and how people new to the project might look for files.
Once you determine how you want your directories to be organized, it is a good idea to stub out an empty directory structure to hold future data, and to document the contents of each directory in a readme file.
Directory Best Practices
- Organize directories hierarchically, with broader topics at the top level of the hierarchy and more specific topics lower in the structure.
- Group files of similar information together in a single directory.
- Name directories after aspects of the project rather than after the names of individual researchers.
- Once you have decided on a directory structure, follow it consistently and audit it periodically.
- Separate ongoing and completed work.
Sources
http://guides.lib.umich.edu/datamanagement/files
Key Points
The 3 writeable locations on Sumhpc: /home/${USER}, /fastscratch, /projects
Files are removed from /fastscratch after 10 days.
Faculty must approve access to their /projects directory.
Good directory management can be helpful for organization.
Globus is used to transfer data into and out of the HPC environment (see lesson 14)
07 Slurm Basics
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What is Slurm
Objectives
Introduction to basic Slurm commands and parameters
General Slurm Commands and Parameters
Usage | Slurm Commands |
---|---|
submit job | srun, sbatch, salloc |
view cluster status | squeue, sinfo |
view cluster config | ‘scontrol show partition’ sacctmgr |
view job information (history) | sacct, seff, sstat |
cancel job | ‘scontrol -f <JobID>’ |
Slurm documentation can be found here:
https://slurm.schedmd.com/documentation.html
Usage: Submitting Jobs to Slurm
-
srun is the command for requesting an interactive job.
-
sbatch is the command for requesting a batch job.
-
salloc is another command for requesting an interactive job.
Usage: Viewing the current status of cluster
-
squeue shows the whole queue:
squeue
-
squeue shows the your jobs in the queue:
squeue -u $USER
-
sinfo shows the cluster partitions and node utilization:
sinfo
Usage: View cluster config
The commands below have multiple parts the whole command within the quotes should be executed.
-
‘scontrol show partition’ shows the partitions:
scontrol show partition
-
‘sacctmgr show qos format=”name%-20,maxjobspu%-12,maxtrespu%-30,maxwall%-20”’ shows the QoS quality of service:
sacctmgr show qos format="name%-20,maxjobspu%-12,maxtrespu%-30,maxwall%-20"
Usage: View job information
-
sacct
-
seff
Usage: Cancel job
- ‘scontrol -f <JobID>’ cancels job
Common Slurm Terms and Parameters for srun and sbatch
Common Name | - flag | --flag | Definition/notes |
---|---|---|---|
Account | -A | --account | |
Array (sbatch only) | -a | --array | execute multiple jobs with one job request |
CPUsPerTask | -c | --cpus-per-task | Usually set to number of CPUs/threads req. |
Dependency | -d | --dependency | Make job execution dependent on previous job |
Pass Variable | --export | Pass BASH variable to job | |
Print help | -h | --help | |
Hold | -H | --hold | Queue job but do not run |
JobName | -J | --job-name | Set job name |
Memory | --mem | Set memory (RAM) requirement for job | |
Request specific nodes | -w | --nodelist | |
Number of nodes per job | -N | --nodes | Usually set to 1 (may vary w/advanced usage) |
Number tasks per node | -n | --ntasks | Usually set to 1 (may vary w/advanced usage) |
Output | -o | --output | |
Partition | -p | --partition | |
QOS | -q | --qos | |
Time Limit | -t | --time | Set limit on job run time |
Slurm Terms and Parameters for sacct
Common Name | - flag | --flag | Definition/notes |
---|---|---|---|
JobID | -j | –jobs | display information about job or jobs |
UserID | -u | ||
Output Format | -o | –format | |
Output Format help | -e | –helpformat | Print a list of fields for –format option |
Output Format long style | -l | –long | long format option |
Jobs included after | -S | –starttime | |
Jobs included before | -E | –endtime | |
UserID | -u | –uid | |
State of Jobs | -s | –state | |
Nodes | -N | –nodelist | Print jobs ran on these nodes |
Usefull sacct format fields
Common Name | Description |
---|---|
AllocCPUS | |
AllocNodes | |
CPUTime | |
Elapsed | |
JobID | Jobs JobID |
JobName | Job Name |
MaxRSS | Max Memory Used for job |
UserID | |
NCPUS | |
Partition | Partition requested |
QOS | QOS requested |
ReqMem | MEM requested |
Start | Start time of job |
End | End time of job |
State | State of job |
ExitCode | Exit code from job |
Key Points
srun can access an interactive job
sbatch executes batch job
srun and sbatch have similar parameters
scontrol and saccmgr can access the Sumhpc partition and QoS configurations
sacct, seff, stat can provide access to my job history and resource utilization
08 Sumhpc
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What is Sumhpc
Objectives
Review of Sumhpc nodes
Sumhpc Basics
-
Sets of nodes are called partitions within slurm.
-
We have 3 sets of nodes (a.k.a. partitions) each with distict properties:
-
compute : General compute nodes 70 core and 754GB each.
-
high_mem: For jobs with large RAM requirements (job requiring more than 754GB of RAM), these nodes have 142 processors each and almost 3TB of RAM each.
-
dev : Dev nodes have faster processors but less memory than compute and high_mem nodes, these nodes have 30 processors each and 180GB of RAM each.
compute | high_mem | dev | |
---|---|---|---|
# of nodes | 100 | 2 | 20 |
Usable memory per node | 754GB | 3022GB* | 180GB |
Usable cores per node | 70 | 142 | 30 |
CPU core speed | 2.70GHz | 2.70GHz | 3.60GHz |
Total partition MEM | 76T | 6T | 3.6T |
Total partition cores | 7000 | 284 | 600 |
how to use? | -p compute | -p high_mem | -p dev & -q dev |
*Note: The max available memory on the high_mem nodes is 3022GB, just shy of 3TB (3072GB binary unit), also the ‘--mem’ value is required to be an integer (no decimals).
**Note: To use the dev partition you must use both the -p dev and -q dev when submitting jobs. See QOS.
Key Points
Sumhpc has 3 compute node types compute, high_mem, and dev.
The compute node types refer to partitions in the slurm configuration.
09 Sumhpc Slurm Configuration
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How is Sumhpc configured with Slurm
How do partitions, QoS, and user set parameters work together?
Objectives
Understand how partitions, QoS, and user set parameters work together
A Layer Model to Slurm Configuration (build from bottom up)
Layer | SLURM TERM | Example(s) | Notes |
---|---|---|---|
Layer 4 | user set parameters | -N, -n, -c, --mem --time | user defines what they need for particular job |
Layer 3 | GRES | - | optional, GPU cluster only for now, not needed for intro-to-hpc |
Layer 2 | QoS | batch, long, dev | user limits set by cluster administration |
Layer 1 | Partition | compute, dev, high_mem | the hardware (based upon hardware properties) |
- Partition is the set of Nodes you would like to run on.
- There is a partition named
dev
and a QoS nameddev
. - QoS is administrative limits to the job, you can pick what QoS (set of limits) work best for your job.
- GRES (optional, GPU cluster only for now, not needed for intro-to-hpc)
- User provided configuration (-N/–nodes, -c)
Sumhpc’s Partitions
compute | high_mem | dev | |
---|---|---|---|
# of nodes | 100 | 2 | 20 |
Usable memory per node | 754GB | 3022GB* | 180GB |
Usable cores per node | 70 | 142 | 30 |
CPU core speed | 2.70GHz | 2.70GHz | 3.60GHz |
Total partition MEM | 76T | 6T | 3.6T |
Total partition cores | 7000 | 284 | 600 |
MaxTime | 14-00:00:00 | 3-00:00:00 | 8:00:00 |
how to use? | -p compute | -p high_mem | -p dev & -q dev |
Sumhpc’s QoS.
batch | long | dev | |
---|---|---|---|
Max Wallltime per job | 3 days | 14 days | 8 hours |
Default Walltime (if not specified) | 1 hour | 1 hour | 1 hour |
Max CPU per user | 700 cores | 140 cores | 60 cores |
Default CPU (if not specified) | 1 core | 1 core | 1 core |
Max Memory per user | 7.6TB | 1TB | 360GB |
Default Memory (if not specified) | 1GB | 1GB | 1GB |
Max running jobs per user | 700 | 10 | 60 |
how to use? | -q batch | -q long | -p dev & -q dev |
Partitions Allowed On | compute, high_mem | compute, * | dev |
* long
QoS is currently allowed on high_mem, but the high_mem partition only allows jobs to run for 3-00:00:00 (72 hr), so just use batch
QoS. Also, long
QoS is limited to 1TB RAM.
- Only certain QoS’s are allowed on specific partitions.
compute
partition only allowslong
andbatch
QoS.dev
partition only allowsdev
QoS.
Important srun and sbatch parameters these are set per job.
Common Name | - flag | --flag | Definition/notes |
---|---|---|---|
CPUsPerTask | -c | --cpus-per-task | Usually set to number of CPUs/threads req. |
Memory | --mem | Set memory (RAM) requirement for job | |
Request specific nodes | -w | --nodelist | |
Partition | -p | --partition | |
QOS | -q | --qos | |
Number of nodes per job | -N | --nodes | Usually set to 1 (may vary w/advanced usage) |
Number tasks per node | -n | --ntasks | Usually set to 1 (may vary w/advanced usage) |
Time Limit | -t | --time | Set limit on job run time |
- we usually set the number of nodes to 1 for each job, but we can run multiple jobs at at time
Key Points
Partitions are sets of nodes with specific hardware configurations.
QoS are limits to resource (it is a shared systems).
Users also set specific parameters so the resources allocated by the scheduler match what is needed.
Job queue wait times are optimized when slurm configuration parameters are optimal for job.
10 Interactive SRUN Jobs
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How to run an interactive job with srun
Objectives
Introduction to srun interactive jobs.
Introduction to setting slurm parameters.
Highlight the how program resource parameters still need to be set.
Slurm srun Basics
To get an interactive node with srun (no MPI)
-
Default srun request small tasks only (1 hr time limit, 1 core, 1 GB, compute partition & batch QoS).
srun --pty bash
-
General format for srun [do not execute without changing values]
srun -p <partition> -q <QoS> -N <Number of Nodes> -n <Number of tasks> -c <cores> --time <time> --pty bash
-
Single-core srun interactive job on dev partition and dev QoS with 4 cpus and 8GB RAM for 4 hrs.
srun -p dev -q dev -N 1 -n 1 -c 1 --mem 8GB --time 4:00:00 --pty bash
-
Sing the srun jingle …
srun -p dev -q dev -N 1 -n 1 -c 1 --mem 10GB --pty bash
-
Additional examples for srun can be found here:
https://github.com/TheJacksonLaboratory/slurm-templates
Hands on srun example
-
Run srun:
srun --pty bash
-
Notice the prompt changed to a specific node and is no longer the login node, this means your free to run programs (within your requested parameters).
-
2 challenges to look around on the node.
Look at the node CPU information
Lets take a look at the CPU(s) information on the node using the
lscpu
command$ lscpu
Result
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 72 On-line CPU(s) list: 0-71 Thread(s) per core: 2 Core(s) per socket: 18 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz Stepping: 4 CPU MHz: 1199.871 CPU max MHz: 3700.0000 CPU min MHz: 1200.0000 BogoMIPS: 5400.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-17,36-53 NUMA node1 CPU(s): 18-35,54-71 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d
- Note that you see all 72 CPUs (0-71) but you only requested 1 CPU (core), this is why it is important that NCPU parameters are set in scripts (e.g. R, python) and programs in addition to the srun and sbatch number of cpu parameters (-c). Some programs will look at the total number of CPUs, see all 72 these, and try to use all 72, although your not allocated all the CPUs. Setting the number of cpus in the srun/sbatch parameters is not sufficient for limiting the number of CPUs programs will try to access. Using more CPUs than allocated will impact other users any job seen doing this will be terminated.
Look at the Node Available RAM (Memory)
Lets take a look at the available memory on the node using the
free -h
command$ free -h
Result
total used free shared buff/cache available Mem: 754G 16G 292G 789M 445G 734G Swap: 63G 319M 63G
-
Note that you see all the MEM on the node but you only requested 1 GB, this is why it is important that you do not try to use more MEM that requested. Setting the MEM in the slurm parameters is not sufficient for limiting the amount of MEM a program will try to access. Using more MEM than allocated will impact other jobs any jobs seen doing this will be terminated.
-
Now exit the interactive node with the
exit
command. -
your prompt should change back to the login node.
Key Points
Set slurm run parameters correctly.
Set your script/program’s core and mem parameters correctly.
11 Software on Sumhpc
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How do we access software on Sumhpc
Objectives
Identify where software can be installed
How to access low-level software such as libraries, compilers, Singularity, and programming language binaries with Environment Modules.
Identify how and when to use singularity.
Software Usage on the SumHPC Cluster
On Sumhpc, we have overhauled how researchers are able to install and easily access the software they need. Users are still able to download and install software into their userspace (/projects
or /home
directories). Additionally, low level development tools such as gcc
, openMPI
, openJDK
, and basic libraires are still available via the module
system. The new and exciting way we are able to provide software on Sumhpc, however, is through software containers. Singularity containers are a cutting edge way of creating your own custom software modules. Containers not only provide you with the software you need, but also a contained environment which ensure that your software runs the exact same way whether it’s on your laptop, on the HPC resources, with a collaborator, or in the cloud.
Userspace Installation
The simplest method of installing software for use with Sumhpc is to install directly into your userspace. This includes any directories you have permissions to write to and modify. This includes your group’s /projects
directory or your personal /home
directory. You also have write permissions on /fastscratch
and /tmp
, but these are not recommended due to their ephemeral nature. Most software installations will default to writing under a /usr/
or /lib/
path. Since these locations are shared by everyone, these “global” trees are not writable by the typical HPC user. Instead, many software installations will allow you to perform a “local” install in a directory of your choosing. Please consult the documentation for your software to find how to change the install path for your software.
Environment Modules
Now, only low-level software such as libraries, compilers, Singularity, and programming language binaries are available through modules. To see what modules are available, run the command module avail
. For more information about how to use the module
command, see the documentation.
Singularity Containers
The newest and most liberating change to software installations on HPC is the introduction of Singularity. Containerization allows you to install whatever software you want inside of a singular, stand-alone Singularity image file (.sif
). This file contains your software, custom environment, and metadata all in one.
To use Singularity, all you have to do is load the module by running module load singularity
in a running job session on Sumhpc. Singularity can download new containers (singularity pull
), run existing ones (singularity run/exec/shell
).
Software Hands On Example.
-
First grab an interactive job with
srun
:
srun -p dev -q dev --time 2:00:00 --pty bash
-
View the available modules with module avail:
module avail
-
Lets load a module:
module load python36
-
Lets check and see what version of singularity is loaded by default:
singularity version
-
Lets make a new directory (called ‘newdir’) on /fastscratch and navigate to it:
mkdir -p /fastscratch/${USER}/newdir
cd /fastscratch/${USER}/newdir
-
Verify your in the correct directory:
pwd
-
View directory files:
ls
-
Pull an docker image for samtools with singularity:
singularity pull staphb_samtools_1.15.sif docker://staphb/samtools:1.15
-
View new sif file:
ls
-
Run the samtools program (just print the help statement):
singularity exec staphb_samtools_1.15.sif samtools --help
-
Run the samtools view program (just print the help statement):
singularity exec staphb_samtools_1.15.sif samtools view --help
-
link to dockerhub samtools source: https://hub.docker.com/r/staphb/samtools/tags
Key Points
Users are able to download and install software into their userspace (
/projects
or/home
directories)Low-level software such as libraries, compilers, Singularity, and programming language binaries with Environment Modules.
Singularity is a new and liberating change to software installations on HPC.
12 Batch Jobs with sbatch
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What are the advantages of sbatch over srun?
How do I run a sbatch job?
Where can I find example headers?
How do I format sbatch headers?
How do I capture the jobid?
Objectives
Know how to format a sbatch script.
Know how to run a sbatch job.
Know how to capture the jobid.
Understand the usefulness of slurm arrays.
sbatch basics
The Slurm sbatch header
#!/bin/bash
#SBATCH --job-name=MY_JOB # Set job name
#SBATCH --partition=dev # Set the partition
#SBATCH --qos=dev # Set the QoS
#SBATCH --nodes=1 # Do not change unless you know what your doing (it set the nodes (do not change for non-mpi jobs))
#SBATCH --ntasks=1 # Do not change unless you know what your doing (it sets the number of tasks (do not change for non-mpi jobs))
#SBATCH --cpus-per-task=4 # Set the number of CPUs for task (change to number of CPU/threads utilized) [-p dev -q dev limited to 30 CPUs per node]
#SBATCH --mem=24GB # Set to a value ~10-20% greater than max amount of memory the job will use (or ~6 GB per core, for dev) (limited to 180 GB per node on dev partition)
#SBATCH --time=8:00:00 # Set the max time limit (dev partition/QoS has 8 hr limit)
###-----load modules if needed-------###
module load singularity
###-----run script below this line-------###
-
note the first line is the location of the bash executable
-
sbatch parameters are then set with the #SBATCH lines
-
your code is entered in the black space below ‘run script below this line’.
-
common slurm batch file extensions include
.sh
and.slurm
. -
Slurm documentation for sbatch can be found at https://slurm.schedmd.com/sbatch.html .
-
Sumhpc example headers can be found at https://github.com/TheJacksonLaboratory/slurm-templates .
Example
-
Lets make a new directory (called ‘adir’) on fastscratch and navigate to it:
mkdir -p /fastscratch/${USER}/adir
cd /fastscratch/${USER}/adir
-
Verify your in the correct directory:
pwd
-
Now grab the batch job templates:
wget https://github.com/TheJacksonLaboratory/slurm-templates/archive/refs/heads/main.zip
-
Unzip the templates:
unzip main.zip
-
Navigate into the new directory slurm-templates:
cd slurm-templates-main
-
Use
cat
to view the template exampleslurm_template_02_compute_batch.sh
:
cat slurm_template_02_compute_batch.sh
-
Navigate into the workshop examples:
cd workshop_examples
-
Use
cat
to view the workshop exampleslurm_00_sleep_dev_dev.sh
:
cat slurm_00_sleep_dev_dev.sh
-
Now before we submit the script, let’s see if we are running anything else:
squeue -u ${USER}
-
Note after we submit the script we can rerun
squeue -u ${USER}
to see its running state: \ -
Now submit the script with the
sbatch
command:
sbatch slurm_00_sleep_dev_dev.sh
-
You will get a job number printed to the screen: \
-
Verify it is running with
squeue -u ${USER}
-
Now lets do it again but this time pipe the job id to a file so we have the jobid for later:
sbatch slurm_00_sleep_dev_dev.sh &> my_job_id.txt
-
Notice nothing was printed to screen:\
-
Verify it is running with
squeue
:
squeue -u ${USER}
-
View the jobid with cat:
cat my_job_id.txt
Arrays are a useful extension to sbatch jobs
-
Arrays are extremely useful in parallelizing tasks that only change by 1 item. For example aligning multiple DNA fastq files with bwa.
-
Arrays are scheduled like 1 job but turn into multiple jobs as determined by the
--array
parameter, every element in the array parameter gets its on job. -
Arrays jobs differ by the array element (a number) passed to the job script in the
${SLURM_ARRAY_TASK_ID}
variable. -
The
${SLURM_ARRAY_TASK_ID}
variable is used within the array’s slurm script to vary the code within the script’s parameters. -
A common array method is to select a line of text from a text file from a list and assign it to a new variable:
NEWVAR=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" /home/${USER}/my_important_list.txt)
- the list should be 1 item per line and be unix formatted
Array examples
-
View the example list
list_lib_names.txt
:
cat list_lib_names.txt
-
View the example sbatch array script
slurm_01_array_dev_dev.sh
:
cat slurm_01_array_dev_dev.sh
-
View the content of our working directory:
ls
-
Run the array job:
sbatch slurm_01_array_dev_dev.sh &> my_array_01_job_id.txt
-
Verify it is running with
squeue
.
squeue -u ${USER}
-
View the jobid with cat.
cat my_array_01_job_id.txt
-
Note that the array only prints one jobid to the output, but many were viewed as running.\
-
View array script output files:
ls
-
View the output with cat.
cat array_run1_1_out_SRR2062637_file_names.txt
-
View the example sbatch array without throttle script
slurm_02_array_dev_dev.sh
:
cat slurm_02_array_dev_dev.sh
-
Run the array job:
sbatch slurm_02_array_dev_dev.sh &> my_array_02_job_id.txt
-
Verify it is running with
squeue
.
squeue -u ${USER}
-
View the output with cat.
cat array_run2_1_out_SRR2062637_file_names.txt
Key Points
sbatch allows you to schedule jobs that will execute as soon as the requested resources are available.
Example headers can be found at https://github.com/TheJacksonLaboratory/slurm-templates
Common slurm batch file extensions include
.sh
and.slurm
Save the job ID into a file with output redirect
13 Job Statistics and Histories
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What types of historical information is available for my jobs?
Objectives
Understand how to use seff and sacct
Use seff
to view information about a particular job.
seff
usage:seff <jobid>
- requires jobid
- Get previous jobid from output file:
ls
- Run seff with jobid.
seff <jobid>
Use sacct
to view more information for more jobs:\
sacct -u ${USER} -S 2022-06-17 -oUser,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,CPUTime,nodelist
Key Points
sacct shows details on memory utilization (MaxRSS), but only for completed jobs.
14 Globus for Data Transfers
Overview
Teaching: 5 min
Exercises: 0 minQuestions
What is Globus?
When do I need to use Globus?
Objectives
Understand what Globus is used for.
Globus
-
Globus is used to transfer data to and from the HPC environment.
-
Globus is used to move data to and from Tier2.
-
Please see workshop video on how to use Globus (emailed after workshop).
Key Points
Use Globus to transfer files to and from HPC.
Use Globus to transfer files to and from Tier2.