The Unix Shell

Queues on HPC

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • How do I know what is in the queue?

  • What are the types of queues available on JAX clusters?

  • How do I find my jobs in the queue?

Objectives
  • This lesson quickly introduces the various queues on the JAX HPC resources and shows users how to identify queues on their own.

The high performance computing (HPC) resources of The Jackson Laboratory represent a shared resource available to all JAX researchers, and in order to keep these resources available to all researchers in a consistent and fair manner, a number of walltime-based queues have been implemented on these resources. These queues allow the Information Technology department the ability to better plan maintenance and schedule upgrade windows on these systems, while providing a more consistent and stable operating environment for JAX HPC users.

Additionally, there have been several observations of inefficient use of these shared HPC resources, such as the declaration of interactive jobs (up to and including using entire nodes interactively) for long periods of time (~1,000 hours), and these walltime-based queues will help alleviate these issues.

Identifying the Queues

qstat -q

$ qstat -q
server: helix-master

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
shortcut           --      --       --      --    0   0 --   E R
long               --      --    360:00:0   --    0   0 --   E R
high_mem           --      --       --      --    4  25 --   E R
htps               --      --       --      --    0   0 --   E R
special            --      --       --      --    0   0 --   E R
training           --      --       --      --    0   0 --   E R
CLIA               --      --       --      --    0   0 --   D S
test               --      --       --      --    0   0 --   E R
short              --      --    04:00:00   --    0   0 --   E R
batch              --      --       --      --   80 1982 --   E R
                                               ----- -----
                                                  84  2007

qstat -Q

$ qstat -Q
Queue              Max    Tot   Ena   Str   Que   Run   Hld   Wat   Trn   Ext T   Cpt
----------------   ---   ----    --    --   ---   ---   ---   ---   ---   --- -   ---
shortcut             0      0   yes   yes     0     0     0     0     0     0 E     0
long                 0      0   yes   yes     0     0     0     0     0     0 E     0
high_mem             0     29   yes   yes    25     4     0     0     0     0 E     0
htps                 0      0   yes   yes     0     0     0     0     0     0 E     0
special              0      0   yes   yes     0     0     0     0     0     0 E     0
training             0      0   yes   yes     0     0     0     0     0     0 E     0
CLIA                 0      0    no    no     0     0     0     0     0     0 E     0
test                 0      0   yes   yes     0     0     0     0     0     0 E     0
short                0      0   yes   yes     0     0     0     0     0     0 E     0
batch                0   2066   yes   yes   322    80  1660     0     0     0 E     4
  1. Short Queue (short)

    Interactivity allowed. No job arrays allowed.

    100 processors, 4 hours. This queue encompasses ~95% of all jobs completed on both Cadillac & Helix and allows users to submit a number of short-runtime, high-resource jobs to the JAX HPC clusters. This queue allows limited interactivity for script development and troubleshooting, while the minimal walltime limits the amount of computational resources a single user can monopolize.

  2. Medium Queue (medium or batch)

    No interactivity allowed. Job arrays allowed.

    300 processors, 48 hours. This queue encompasses ~99.75% of all jobs completed on both Cadillac & Helix and allows users to submit a large number of medium-runtime, high-resource jobs to the Jax HPC clusters. This queue does not allow interactivity. Users should use interactivity available in the short queue or test their submissions on the appropriate cluster development nodes (helix-dev.jax.org or cadillac-dev.jax.org).

    As part of the initial transistion to these queues, this queue may be configured as the new batch queue or the current batch queue may be configured to route jobs specifying the batch queue to this medium-sized queue. Jobs that do not specify a queue will by default be run in the medium batch queue.
  3. Long Queue (long)

    No interactivity allowed. Job arrays allowed.

    20 processors, 168 hours. This queue is where long-runtime, low-resource jobs should be run. When combined with short and medium queues, this queue and the shorter walltime queues encompass ~99.95% of all jobs completed on both Cadillac & Helix.

  4. Special / Reserved Queue (special)

    The special queue is reserved for special requests by researchers in conjunction with the IT department. Users should consult with IT if their jobs require more than 1 week to complete or have other needs that the above queues do not address. Output showing the walltime exceeded from your job in the long queue or other reasons the job can not function within the existing queue configurations may be required.

These three queues, with the exception of special allow a single user to utilize 300 processors of both Cadillac (~31.3%) and Helix (~17.4%) exclusively for their jobs at any one time. Additionally, each user will still be allowed to have up to 1,000 jobs total (combined running and queued) in these queues (1,000 total, not 1,000 per queue) at any given time.

qstat

The qstat (or Queue Status) command shows the current state of all running jobs on the system.

Jobs are represented by JobIDs in the following format:

XXXXXXXX.helix-master.jax.org

where XXXXXXXX is a numeric identifier.

When used with the -u userid option, you can look at only the jobs being submitted to the queue by a particular user. When used with the -f option, the full information about the jobs running by a user can be identified.

Exercise: qstat On the cluster, use qstat to see the list of all jobs being managed by the queue sytem, then pick a userid, and look at all jobs in the queue from a particular user.

Examples:

qstat
qstat -u ssander
qstat -f -u ssander
qstat XXXXXXXX.helix-master.jax.org

Key Points