Overview
Teaching: 5 min
Exercises: 0 minQuestions
How do I know what is in the queue?
What are the types of queues available on JAX clusters?
How do I find my jobs in the queue?
Objectives
This lesson quickly introduces the various queues on the JAX HPC resources and shows users how to identify queues on their own.
The high performance computing (HPC) resources of The Jackson Laboratory represent a shared resource available to all JAX researchers, and in order to keep these resources available to all researchers in a consistent and fair manner, a number of walltime-based queues have been implemented on these resources. These queues allow the Information Technology department the ability to better plan maintenance and schedule upgrade windows on these systems, while providing a more consistent and stable operating environment for JAX HPC users.
Additionally, there have been several observations of inefficient use of these shared HPC resources, such as the declaration of interactive jobs (up to and including using entire nodes interactively) for long periods of time (~1,000 hours), and these walltime-based queues will help alleviate these issues.
qstat -q
$ qstat -q
server: helix-master
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
shortcut -- -- -- -- 0 0 -- E R
long -- -- 360:00:0 -- 0 0 -- E R
high_mem -- -- -- -- 4 25 -- E R
htps -- -- -- -- 0 0 -- E R
special -- -- -- -- 0 0 -- E R
training -- -- -- -- 0 0 -- E R
CLIA -- -- -- -- 0 0 -- D S
test -- -- -- -- 0 0 -- E R
short -- -- 04:00:00 -- 0 0 -- E R
batch -- -- -- -- 80 1982 -- E R
----- -----
84 2007
qstat -Q
$ qstat -Q
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext T Cpt
---------------- --- ---- -- -- --- --- --- --- --- --- - ---
shortcut 0 0 yes yes 0 0 0 0 0 0 E 0
long 0 0 yes yes 0 0 0 0 0 0 E 0
high_mem 0 29 yes yes 25 4 0 0 0 0 E 0
htps 0 0 yes yes 0 0 0 0 0 0 E 0
special 0 0 yes yes 0 0 0 0 0 0 E 0
training 0 0 yes yes 0 0 0 0 0 0 E 0
CLIA 0 0 no no 0 0 0 0 0 0 E 0
test 0 0 yes yes 0 0 0 0 0 0 E 0
short 0 0 yes yes 0 0 0 0 0 0 E 0
batch 0 2066 yes yes 322 80 1660 0 0 0 E 4
Short Queue (short
)
Interactivity allowed. No job arrays allowed.
100 processors, 4 hours. This queue encompasses ~95% of all jobs completed on both Cadillac & Helix and allows users to submit a number of short-runtime, high-resource jobs to the JAX HPC clusters. This queue allows limited interactivity for script development and troubleshooting, while the minimal walltime limits the amount of computational resources a single user can monopolize.
Medium Queue (medium
or batch
)
No interactivity allowed. Job arrays allowed.
300 processors, 48 hours. This queue encompasses ~99.75% of all jobs completed on both Cadillac & Helix and allows users to submit a large number of medium-runtime, high-resource jobs to the Jax HPC clusters. This queue does not allow interactivity. Users should use interactivity available in the short
queue or test their submissions on the appropriate cluster development nodes (helix-dev.jax.org
or cadillac-dev.jax.org
).
As part of the initial transistion to these queues, this queue may be configured as the new batch queue or the current batch queue may be configured to route jobs specifying the batch queue to this medium-sized queue. Jobs that do not specify a queue will by default be run in the medium |
batch queue. |
Long Queue (long
)
No interactivity allowed. Job arrays allowed.
20 processors, 168 hours. This queue is where long-runtime, low-resource jobs should be run. When combined with short
and medium
queues, this queue and the shorter walltime queues encompass ~99.95% of all jobs completed on both Cadillac & Helix.
Special / Reserved Queue (special
)
The special
queue is reserved for special requests by researchers in conjunction with the IT department. Users should consult with IT if their jobs require more than 1 week to complete or have other needs that the above queues do not address. Output showing the walltime exceeded from your job in the long
queue or other reasons the job can not function within the existing queue configurations may be required.
These three queues, with the exception of special
allow a single user to utilize 300 processors of both Cadillac (~31.3%) and Helix (~17.4%) exclusively for their jobs at any one time. Additionally, each user will still be allowed to have up to 1,000 jobs total (combined running and queued) in these queues (1,000 total, not 1,000 per queue) at any given time.
qstat
The qstat
(or Queue Status) command shows the current state of all running jobs on the system.
Jobs are represented by JobIDs in the following format:
XXXXXXXX.helix-master.jax.org
where XXXXXXXX is a numeric identifier.
When used with the -u userid
option, you can look at only the jobs being submitted to the queue by a particular user. When used with the -f
option, the full information about the jobs running by a user can be identified.
Exercise: qstat
On the cluster, use qstat to see the list of all jobs being managed by the queue sytem, then pick a userid, and look at all jobs in the queue from a particular user.
Examples:
qstat
qstat -u ssander
qstat -f -u ssander
qstat XXXXXXXX.helix-master.jax.org
Key Points