Submitting a job

For those familiar with GridEngine, Slurm documentation provide a Rosetta Stone for schedulers, to ease the transition.

Slurm commands

Slurm allows requesting resources and submitting jobs in a variety of ways. The main Slurm commands to submit jobs are:

  • srun
    • Request resources and runs a command on the allocated compute node(s)

    • Blocking: will not return until the command ends

  • sbatch
    • Request resources and runs a script on the allocated compute node(s)

    • Asynchronous: will return as soon as the job is submitted

Tip

Slurm Basics

  • Job

A Job is an allocation of resources (CPUs, RAM, time, etc.) reserved for the execution of a specific process:

  • The allocation is defined in the submission script as the number of Tasks (--ntasks) multiplied by the number of CPUs per Task (--cpus-per-task) and corresponds to the maximum resources that can be used in parallel,

  • The submission script, via sbatch, creates one or more Job Steps and manages the distribution of Tasks on Compute Nodes.

  • Tasks

A Task is a process to which are allocated the resources defined in the script via the --cpus-per-task option. A Task can have these resources like any other process (creation of threads, of sub-processes possibly themselves multi-threaded).

This is the Job’s resource allocation unit. CPUs not used by a Task will be lost, not usable by any other Task or Step. If the Task creates more processes/threads than allocated CPUs, these threads will share the allocation.

  • Job Steps

A Job Step represents a stage, or section, of the processing performed by the Job. It executes one or more Tasks via the srun command. This division into Job Steps offers great flexibility in the organization of the steps in the Job and the management, and analysis, of the allocated resources:

  • Steps can be executed sequentially or in parallel,

  • one Step can initiate one or more Tasks, executed sequentially or in parallel,

  • Steps are tracked by the sstat/sacct commands, allowing both Step-by-Step progress tracking of a Job during it’s execution, and detailed resource usage statistics for each Step (during and after execution).

Using srun for a single task, inside a submission script, is not mandatory.

  • Partition

A Partition is a logical grouping of Compute Nodes. This grouping makes it possible to specialize and optimize each partition for a particular type of job.

See Computing resources and Clusters/Partitions overview for more details.

Job script

To run a job on the system you need to create a submission script (or job script, or batch script). This script is a regular shell script (bash) with some directives specifying the number of CPUs, memory, etc., that will be interpreted by the scheduling system upon submission.

  • very simple

#!/bin/bash
#
#SBATCH --job-name=test

hostname -s
sleep 60s

Writing submission scripts can be tricky, see more in Batch scripts. See also our repository of examples scripts.

First job

submit your job script with:

$ sbatch myfirstjob.sh
Submitted batch job 623

Slurm will return with a $JOBID if the job is accepted, else an error message. Without any options about output, it will be defaulted to slurm-$JOBID.out (slurm-623.out, with the above example), in the submission directory.

Once submitted, the job enters the queue in the PENDING (PD) state. When resources become available and the job has sufficient priority, an allocation is created for it and it moves to the RUNNING (R) state. If the job completes correctly, it goes to the COMPLETED state, otherwise, its state is set to FAILED.

Tip

You can submit jobs from any login node to any partition. Login nodes are only segregated for build (CPU µarch) and scratch access.

Monitor your jobs

You can monitor your job using either its name (#SBATCH --job-name) or its $JOBID with Slurm’s squeue [1] command:

$ squeue -j 623
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  623        E5     test ltaulell  R       0:04      1 c82gluster2

By default, squeue show every pending and running jobs. You can filter in your own jobs, using -u $USER or --me option:

$ squeue --me
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  623        E5     test ltaulell  R       0:04      1 c82gluster2

If needed, you can modify the output of squeue [1]. Here’s an example (add CPUs to default output):

$ squeue --me --format="%.7i %.9P %.8j %.8u %.2t %.10M %.6D %.4C %N"
  JOBID PARTITION     NAME     USER ST       TIME  NODES CPUS NODELIST
  38956      Lake     test ltaulell  R       0:41      1    1 c6420node172

Usefull bash aliases:

alias pending='squeue --me --states=PENDING --sort=S,Q --format="%.10i %.12P %.8j %.8u %.6D %.4C %.20R %Q %.19S"  # my pending jobs
alias running='squeue --me --states=RUNNING --format="%.10i %.12P %.8j %.8u %.2t %.10M %.6D %.4C %R %.19e"  # my running jobs

Analyzing currently running jobs

The sstat [3] command allows users to easily pull up status information about their currently running jobs. This includes information about CPU usage, task information, node information, resident set size (RSS), and virtual memory (VM). You can invoke the sstat command as such:

$ sstat --jobs=$JOB_ID

By default, sstat will pull up significantly more information than what would be needed in the commands default output. To remedy this, you can use the –format flag to choose what you want in your output. See format flag in man sstat.

Some relevant variables are listed in the table below:

Variable

Description

avecpu

Average CPU time of all tasks in job.

averss

Average resident set size of all tasks.

avevmsize

Average virtual memory of all tasks in a job.

jobid

The id of the Job.

maxrss

Maximum number of bytes read by all tasks in the job.

maxvsize

Maximum number of bytes written by all tasks in the job.

ntasks

Number of tasks in a job.

For example, let’s print out a job’s average job id, cpu time, max rss, and number of tasks:

sstat --jobs=$JOB_ID --format=jobid,cputime,maxrss,ntasks

You can obtain more detailed informations about a job using Slurm’s scontrol [2] command. This can be very usefull for troubleshooting.

$ scontrol show jobid $JOB_ID

$ scontrol show jobid 38956
JobId=38956 JobName=test
UserId=ltaulell(*****) GroupId=psmn(*****) MCS_label=N/A
Priority=8628 Nice=0 Account=staff QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:08 TimeLimit=8-00:00:00 TimeMin=N/A
SubmitTime=2022-07-08T12:00:20 EligibleTime=2022-07-08T12:00:20
AccrueTime=2022-07-08T12:00:20
StartTime=2022-07-08T12:00:22 EndTime=2022-07-16T12:00:22 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-08T12:00:22
Partition=Lake AllocNode:Sid=x5570comp2:446203
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c6420node172
BatchHost=c6420node172
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=385582M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=385582M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/ltaulell/tests/env.sh
WorkDir=/home/ltaulell/tests
StdErr=/home/ltaulell/tests/slurm-38956.out
StdIn=/dev/null
StdOut=/home/ltaulell/tests/slurm-38956.out
Power=
NtasksPerTRES:0