09 Sumhpc Slurm Configuration

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How is Sumhpc configured with Slurm

  • How do partitions, QoS, and user set parameters work together?

Objectives
  • Understand how partitions, QoS, and user set parameters work together

A Layer Model to Slurm Configuration (build from bottom up)

Layer SLURM TERM Example(s) Notes
Layer 4 user set parameters -N, -n, -c, --mem --time user defines what they need for particular job
Layer 3 GRES - optional, GPU cluster only for now, not needed for intro-to-hpc
Layer 2 QoS batch, long, dev user limits set by cluster administration
Layer 1 Partition compute, dev, high_mem the hardware (based upon hardware properties)

Sumhpc’s Partitions

  compute high_mem dev
# of nodes 100 2 20
Usable memory per node 754GB 3022GB* 180GB
Usable cores per node 70 142 30
CPU core speed​​​​​​​ 2.70GHz 2.70GHz 3.60GHz
Total partition MEM 76T 6T 3.6T
Total partition cores 7000 284 600
MaxTime 14-00:00:00 3-00:00:00 8:00:00
how to use? -p compute -p high_mem -p dev & -q dev

Sumhpc’s QoS.

  batch long dev
Max Wallltime per job 3 days 14 days 8 hours
Default Walltime (if not specified) 1 hour 1 hour 1 hour
Max CPU per user 700 cores 140 cores 60 cores
Default CPU (if not specified) 1 core 1 core 1 core
Max Memory per user 7.6TB 1TB 360GB
Default Memory (if not specified) 1GB 1GB 1GB
Max running jobs per user 700 10 60
how to use? -q batch -q long -p dev & -q dev
Partitions Allowed On compute, high_mem compute, * dev

* long QoS is currently allowed on high_mem, but the high_mem partition only allows jobs to run for 3-00:00:00 (72 hr), so just use batch QoS. Also, long QoS is limited to 1TB RAM.

Important srun and sbatch parameters these are set per job.

Common Name - flag --flag Definition/notes
CPUsPerTask -c --cpus-per-task Usually set to number of CPUs/threads req.
Memory   --mem Set memory (RAM) requirement for job
Request specific nodes -w --nodelist  
Partition -p --partition  
QOS -q --qos  
Number of nodes per job -N --nodes Usually set to 1 (may vary w/advanced usage)
Number tasks per node -n --ntasks Usually set to 1 (may vary w/advanced usage)
Time Limit -t --time Set limit on job run time

Key Points

  • Partitions are sets of nodes with specific hardware configurations.

  • QoS are limits to resource (it is a shared systems).

  • Users also set specific parameters so the resources allocated by the scheduler match what is needed.

  • Job queue wait times are optimized when slurm configuration parameters are optimal for job.