Submitting a job
================

For those familiar with GridEngine, Slurm documentation provide a `Rosetta Stone for schedulers <https://slurm.schedmd.com/rosetta.pdf>`_, to ease the transition.

Slurm commands
--------------

:term:`Slurm` allows requesting resources and submitting jobs in a variety of ways. The main Slurm commands to submit jobs are:

* srun
    * Request resources and **runs a command** on the allocated compute node(s)
    * **Blocking**: will not return until the command ends

* sbatch
    * Request resources and **runs a script** on the allocated compute node(s)
    * **Asynchronous**: will return as soon as the job is submitted


.. TIP:: **Slurm Basics**

    .. _slurm_basics:

    * **Job**

    A Job is an allocation of resources (CPUs, RAM, time, etc.) reserved for the execution of a specific process:

        * The allocation is defined in the submission script as the number of Tasks (``--ntasks``) multiplied by the number of CPUs per Task (``--cpus-per-task``) and corresponds to the maximum resources that can be used in parallel,
        * The submission script, via ``sbatch``, creates one or more Job Steps and manages the distribution of Tasks on Compute Nodes.

    * **Tasks**

    A Task is a process to which are allocated the resources defined in the script via the ``--cpus-per-task`` option. A Task can have these resources like any other process (creation of threads, of sub-processes possibly themselves multi-threaded).

    This is the Job's resource allocation unit. CPUs not used by a Task will be **lost**, not usable by any other Task or Step. If the Task creates more processes/threads than allocated CPUs, these threads will share the allocation.

    * **Job Steps**

    A Job Step represents a stage, or section, of the processing performed by the Job. It executes one or more Tasks via the ``srun`` command. This division into Job Steps offers great flexibility in the organization of the steps in the Job and the management, and analysis, of the allocated resources:

        * Steps can be executed sequentially or in parallel,
        * one Step can initiate one or more Tasks, executed sequentially or in parallel,
        * Steps are tracked by the ``sstat/sacct`` commands, allowing both Step-by-Step progress tracking of a Job during it's execution, and detailed resource usage statistics for each Step (during and after execution).

    Using ``srun`` for a single task, inside a submission script, is not mandatory.

    * **Partition**

    A Partition is a logical grouping of Compute Nodes. This grouping makes it possible to specialize and optimize each partition for a particular type of job.

    See :doc:`computing_resources` and :doc:`partitions_overview` for more details.


.. _job_script:

Job script
----------

To run a job on the system you need to create a ``submission script`` (or job script, or batch script). This script is a regular shell script (bash) with some directives specifying the number of CPUs, memory, etc., that will be interpreted by the scheduling system upon submission.

* very simple

.. code-block:: bash

    #!/bin/bash
    #
    #SBATCH --job-name=test

    hostname -s
    sleep 60s

Writing submission scripts can be tricky, see more in :doc:`batch_scripts`. See also our `repository of examples scripts <https://github.com/ltaulell/submission_scripts>`_.


First job
---------

submit your job script with:

.. code-block:: bash

    $ sbatch myfirstjob.sh
    Submitted batch job 623


:term:`Slurm` will return with a ``$JOBID`` if the job is accepted, else an error message. Without any options about output, it will be defaulted to ``slurm-$JOBID.out`` (slurm-623.out, with the above example), in the submission directory.

Once submitted, the job enters the queue in the *PENDING* (PD) state. When resources become available and the job has sufficient priority, an allocation is created for it and it moves to the *RUNNING* (R) state. If the job completes correctly, it goes to the *COMPLETED* state, otherwise, its state is set to *FAILED*.

.. TIP:: **You can submit jobs from any login node to any partition. Login nodes are only segregated for build (CPU µarch) and scratch access.**

Monitor your jobs
-----------------

You can monitor your job using either its name (``#SBATCH --job-name``) or its ``$JOBID`` with Slurm's ``squeue`` [#squeue]_ command:

.. code-block:: bash

    $ squeue -j 623
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      623        E5     test ltaulell  R       0:04      1 c82gluster2

By default, ``squeue`` show every pending and running jobs. You can filter in your own jobs, using ``-u $USER`` or ``--me`` option:

.. code-block:: bash

    $ squeue --me
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      623        E5     test ltaulell  R       0:04      1 c82gluster2

If needed, you can modify the output of ``squeue`` [#squeue]_. Here's an example (add CPUs to default output):

.. code-block:: bash

    $ squeue --me --format="%.7i %.9P %.8j %.8u %.2t %.10M %.6D %.4C %N"
      JOBID PARTITION     NAME     USER ST       TIME  NODES CPUS NODELIST
      38956      Lake     test ltaulell  R       0:41      1    1 c6420node172

Usefull bash aliases:

.. code-block:: bash

    alias pending='squeue --me --states=PENDING --sort=S,Q --format="%.10i %.12P %.8j %.8u %.6D %.4C %.20R %Q %.19S"  # my pending jobs
    alias running='squeue --me --states=RUNNING --format="%.10i %.12P %.8j %.8u %.2t %.10M %.6D %.4C %R %.19e"  # my running jobs


Analyzing currently running jobs
--------------------------------

The ``sstat`` [#sstat]_ command allows users to easily pull up status information about their currently running jobs. This includes information about **CPU usage**, **task information**, **node information**, **resident set size (RSS)**, and **virtual memory (VM)**. You can invoke the ``sstat`` command as such:

.. code-block:: bash

    $ sstat --jobs=$JOB_ID


By default, sstat will pull up significantly more information than what would be needed in the commands default output. To remedy this, you can use the `--format` flag to choose what you want in your output. See format flag in ``man sstat``.

Some relevant variables are listed in the table below:

+-----------+----------------------------------------------------------+
| Variable  | Description                                              |
+===========+==========================================================+
| avecpu    | Average CPU time of all tasks in job.                    |
+-----------+----------------------------------------------------------+
| averss    | Average resident set size of all tasks.                  |
+-----------+----------------------------------------------------------+
| avevmsize | Average virtual memory of all tasks in a job.            |
+-----------+----------------------------------------------------------+
| jobid     | The id of the Job.                                       |
+-----------+----------------------------------------------------------+
| maxrss    | Maximum number of bytes read by all tasks in the job.    |
+-----------+----------------------------------------------------------+
| maxvsize  | Maximum number of bytes written by all tasks in the job. |
+-----------+----------------------------------------------------------+
| ntasks    | Number of tasks in a job.                                |
+-----------+----------------------------------------------------------+

For example, let's print out a job's average job id, cpu time, max rss, and number of tasks:

.. code-block:: bash

    sstat --jobs=$JOB_ID --format=jobid,cputime,maxrss,ntasks


You can obtain more detailed informations about a job using Slurm's ``scontrol`` [#scontrol]_ command. This can be very usefull for troubleshooting.

.. code-block:: bash

    $ scontrol show jobid $JOB_ID

    $ scontrol show jobid 38956
    JobId=38956 JobName=test
    UserId=ltaulell(*****) GroupId=psmn(*****) MCS_label=N/A
    Priority=8628 Nice=0 Account=staff QOS=normal
    JobState=RUNNING Reason=None Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:00:08 TimeLimit=8-00:00:00 TimeMin=N/A
    SubmitTime=2022-07-08T12:00:20 EligibleTime=2022-07-08T12:00:20
    AccrueTime=2022-07-08T12:00:20
    StartTime=2022-07-08T12:00:22 EndTime=2022-07-16T12:00:22 Deadline=N/A
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-08T12:00:22
    Partition=Lake AllocNode:Sid=x5570comp2:446203
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=c6420node172
    BatchHost=c6420node172
    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=1,mem=385582M,node=1,billing=1
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=1 MinMemoryNode=385582M MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
    Command=/home/ltaulell/tests/env.sh
    WorkDir=/home/ltaulell/tests
    StdErr=/home/ltaulell/tests/slurm-38956.out
    StdIn=/dev/null
    StdOut=/home/ltaulell/tests/slurm-38956.out
    Power=
    NtasksPerTRES:0


.. [#squeue] You can get the complete list of parameters by referring to the ``squeue`` manual page (``man squeue``).

.. [#scontrol] You can get the complete list of parameters by referring to the ``scontrol`` manual page (``man scontrol``).

.. [#sstat] You can get the complete list of parameters by referring to the ``sstat`` manual page (``man sstat``).