Cluster Management
Introduction to SLURM for Job Scheduling
SLURM (Simple Linux Utility for Resource Management) is a widely-used workload manager for clusters, designed to allocate resources efficiently and manage job queues. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. SchedMD provides a full set of documentation on SLURM. Please check out their quick start guide and download their cheat sheet before you begin. This page is only a short tutorial and provides an introductory guide on using SLURM for submitting and managing jobs on the ReD Environment.
Basic SLURM Concepts
Node: A virtual server (AWS EC2 instances) within the cluster.
Partition: A logical grouping of nodes of the same EC2 instance type, functioning similarly to queues.
Job: A user-submitted task for resource allocation and execution.
Accessing SLURM
Due to security requirements, the typical ssh into a login node is not availble for the ParallelCluster.
Open a terminal within Open OnDemand from the Cluster tab and select _CGC-AWS Shell Access
.
Monitoring Jobs
To check the status of your active jobs:
squeue
This command lists your active, pending, or recently completed jobs. Important fields include:
- JOBID: Unique job identifier.
- PARTITION: Specific partition the job is running in
- NAME: Name of the job
- ST: Job state (e.g., CF, configuring, PD for pending, R for running).
- TIME: Runtime duration.
- NODES: Number of nodes job is occupying
Note: You can customize the output of the squeue command by setting the environment variable SQUEUE_FORMAT.
For example, adding the following line to your ~/.bashrc
export SQUEUE_FORMAT="%.18i %.18P %.8j %.16u %.8T %.10M %.9l %.6C %.6D %R"`
Results in the following output.
[syockel_test@ip-10-3-2-176 dd-test]$ squeue
JOBID PARTITION NAME USER STATE TIME TIME_LIMI CPUS NODES NODELIST(REASON)
216 urcdtest-med dd.job syockel_test CONFIGUR 4:02 UNLIMITED 16 1 urcdtest-med-dy-urcdtest-med-cr-0-1
214 urcdtest-med dd.job syockel_test RUNNING 5:01 UNLIMITED 16 1 urcdtest-med-st-urcdtest-med-cr-0-1
See the manual on squeue for a full set of options and field definitions.
Viewing Job Details
For more detailed job information use the scontrol command:
scontrol show job JOBID
Canceling Jobs
To cancel a job:
scancel JOBID
Checking Job Utilization
Slurm has a rich accounting database of prior run jobs. This is a great way to check how long your job ran, how much memory, ... Like squeue, many different fields can be displayed with sacct. By default sacct provides only limited information:
syockel_test@ip-10-3-2-176 ~]$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
214 dd.job urcdtest-+ 16 COMPLETED 0:0
214.batch batch 16 COMPLETED 0:0
214.extern extern 16 COMPLETED 0:0
215 dd.job urcdtest-+ 0 FAILED 1:0
Similar to squeue, you can update your ~/.bashrc with an environment varible for sacct to enhance your output. For example adding:
export SACCT_FORMAT="jobid,user,alloccpus,reqmem,maxrss,TotalCPU,start,end,Elapsed"
Provides the following output
syockel_test@ip-10-3-2-176 ~]$ sacct
JobID User AllocCPUS ReqMem MaxRSS TotalCPU Start End Elapsed
------------ --------- ---------- ---------- ---------- ---------- ------------------- ------------------- ----------
214 syockel_+ 16 31129M 07:08.715 2025-01-15T21:20:10 2025-01-15T21:27:22 00:07:12
214.batch 16 197124K 07:08.713 2025-01-15T21:20:10 2025-01-15T21:27:22 00:07:12
214.extern 16 764K 00:00.001 2025-01-15T21:20:10 2025-01-15T21:27:22 00:07:12
215 syockel_+ 0 3891M 00:00:00 2025-01-15T21:20:25 2025-01-15T21:20:25 00:00:00
MaxRSS is the max memory used. In job 214, 31G was requested, but only 197M was consumed. This means I could have used a much smaller instance type to conserve on resources (and cost). See the manual on sacct for a full set of options and field definitions.
Conclusion
SLURM is a robust and flexible tool that helps efficiently manage compute resources in cluster environments. By mastering the basic commands and concepts outlined above, you can effectively run your computational tasks. For more information, explore the SLURM documentation for detailed guidance and advanced functionalities.