Cluster Management

Introduction to SLURM for Job Scheduling

SLURM (Simple Linux Utility for Resource Management) is a widely-used workload manager for clusters, designed to allocate resources efficiently and manage job queues. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. SchedMD provides a full set of documentation on SLURM. Please check out their quick start guide and download their cheat sheet before you begin. This page is only a short tutorial and provides an introductory guide on using SLURM for submitting and managing jobs on the ReD Environment.

Basic SLURM Concepts

Node: A virtual server (AWS EC2 instances) within the cluster.

Partition: A logical grouping of nodes of the same EC2 instance type, functioning similarly to queues.

Job: A user-submitted task for resource allocation and execution.

Accessing SLURM

Due to security requirements, the typical ssh into a login node is not availble for the ParallelCluster.

Open a terminal within Open OnDemand from the Cluster tab and select _CGC-AWS Shell Access.

Monitoring Jobs

To check the status of your active jobs:

squeue

This command lists your active, pending, or recently completed jobs. Important fields include:

JOBID: Unique job identifier.
PARTITION: Specific partition the job is running in
NAME: Name of the job
ST: Job state (e.g., CF, configuring, PD for pending, R for running).
TIME: Runtime duration.
NODES: Number of nodes job is occupying

Note: You can customize the output of the squeue command by setting the environment variable SQUEUE_FORMAT.

For example, adding the following line to your ~/.bashrc

export SQUEUE_FORMAT="%.18i %.18P %.8j %.16u %.8T %.10M %.9l %.6C %.6D %R"`

Results in the following output.

[syockel_test@ip-10-3-2-176 dd-test]$ squeue
JOBID          PARTITION     NAME             USER    STATE       TIME TIME_LIMI   CPUS  NODES NODELIST(REASON)
216       urcdtest-med   dd.job     syockel_test CONFIGUR       4:02 UNLIMITED     16      1 urcdtest-med-dy-urcdtest-med-cr-0-1
214       urcdtest-med   dd.job     syockel_test  RUNNING       5:01 UNLIMITED     16      1 urcdtest-med-st-urcdtest-med-cr-0-1

See the manual on squeue for a full set of options and field definitions.

Viewing Job Details

For more detailed job information use the scontrol command:

scontrol show job JOBID

Canceling Jobs

To cancel a job:

scancel JOBID

Checking Job Utilization

Slurm has a rich accounting database of prior run jobs. This is a great way to check how long your job ran, how much memory, ... Like squeue, many different fields can be displayed with sacct. By default sacct provides only limited information:

syockel_test@ip-10-3-2-176 ~]$ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
214              dd.job urcdtest-+                    16  COMPLETED      0:0
214.batch         batch                               16  COMPLETED      0:0
214.extern       extern                               16  COMPLETED      0:0
215              dd.job urcdtest-+                     0     FAILED      1:0

Similar to squeue, you can update your ~/.bashrc with an environment varible for sacct to enhance your output. For example adding:

export SACCT_FORMAT="jobid,user,alloccpus,reqmem,maxrss,TotalCPU,start,end,Elapsed"

Provides the following output

syockel_test@ip-10-3-2-176 ~]$ sacct
JobID             User  AllocCPUS     ReqMem     MaxRSS   TotalCPU               Start                 End    Elapsed
------------ --------- ---------- ---------- ---------- ---------- ------------------- ------------------- ----------
214          syockel_+         16     31129M             07:08.715 2025-01-15T21:20:10 2025-01-15T21:27:22   00:07:12
214.batch                      16               197124K  07:08.713 2025-01-15T21:20:10 2025-01-15T21:27:22   00:07:12
214.extern                     16                  764K  00:00.001 2025-01-15T21:20:10 2025-01-15T21:27:22   00:07:12
215          syockel_+          0      3891M              00:00:00 2025-01-15T21:20:25 2025-01-15T21:20:25   00:00:00

MaxRSS is the max memory used. In job 214, 31G was requested, but only 197M was consumed. This means I could have used a much smaller instance type to conserve on resources (and cost). See the manual on sacct for a full set of options and field definitions.

Conclusion

SLURM is a robust and flexible tool that helps efficiently manage compute resources in cluster environments. By mastering the basic commands and concepts outlined above, you can effectively run your computational tasks. For more information, explore the SLURM documentation for detailed guidance and advanced functionalities.