In general, there are two methods for executing jobs within the cluster:
In the non-interactive approach, you create a script using bash, Python, or similar languages, define the job parameters, and submit it using the sbatch
command.
In this scenario, Slurm reserves the necessary resources for your job and executes it on the compute nodes when the required resources become available.
With the interactive method, Slurm reserves computing resources for you and grants you shell access, allowing you to directly execute commands.
Before submitting any job for execution, it's advisable to check the overall status of the available resources and Slurm itself.
The command sinfo
provides status information of nodes in various partitions (also the associated time limit for each partition).
The default partition is marked with an "*".
This information can be useful in deciding where to submit your job. Status codes (abbreviated form) are explained below:
Status code | Meaning |
---|---|
alloc | The node has been allocated to one or more jobs. |
mix | The node has some of its CPUs ALLOCATED while others are IDLE. |
idle | The node is not allocated to any jobs and is available for use. |
down | node is down and unavailable for use. |
drain | The node is unavailable for use per system administrator request. (for maintenance etc.) |
drng | means that the node is being drained but is still running a user job. The node will be marked as drained right after the user job is finished. Do not worry if you have a job running on a node with this state. |
The main way to run jobs is by submitting a script with the sbatch
command. The command to submit a job is as simple as:
sbatch myscript.sh
The commands specified in the myscript.sh
file will then be run on the first available compute node that fits the resources requested in the script. sbatch
returns immediately after submission; commands are not run as foreground processes and won’t stop if you disconnect from the cluster.
An sbatch script is just a script with additional #SBATCH
directives to tell the scheduler about your job's requirements. A typical submission script, in this case myscript.sh
running a program called myprog
with arguments, will look like this:
#!/bin/bash
#SBATCH --partition=gpu-a30 # Request a specific partition
#SBATCH --ntasks=1 # Number of tasks (see below)
#SBATCH --cpus-per-task=2 # Number of CPU cores per task
#SBATCH --nodes=1 # Ensure that all cores are on one machine
#SBATCH --time=0-00:05 # Runtime in D-HH:MM
#SBATCH --gres=gpu:2 # Optionally type and number of gpus
#SBATCH --mem=50G # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=hostname_%j.out # File to which STDOUT will be written
#SBATCH --error=hostname_%j.err # File to which STDERR will be written
#SBATCH --mail-type=END # Type of email notification - BEGIN,END,FAIL,ALL
#SBATCH --mail-user=user@abc.com # Email to which notifications will be sent
# Print info about current job - makes debug easy
scontrol show job $SLURM_JOB_ID
# Insert your commands here
./myprog $arg1 $arg2
Now the above script can be submitted as:
arg1=val1 arg2=val2 sbatch myscript.sh
# or by modifing options (e.g partition) in the command line instead:
arg1=val1 arg2=val2 sbatch -p gpu-a30 myscript.sh
In general, the script is composed of 3 parts.
#!/bin/bash
line allows the script to be run as a bash script - can be python or ...#SBATCH
lines are technically bash comments, but are interpreted as various parameters for the SLURM scheduler#SBATCH --partition=gpu-a30
Specifies the SLURM partition (a.k.a. queue) under which the script will be run.
#SBATCH --ntasks=1
Sets the number of tasks that you’re requesting. Make sure that your code can use multiple cores before requesting more than one. When running MPI code,
--ntasks
should be set to the number of parallel MPI programs you want to start (i.e., the same number you give as-n
parameter tompirun
). When running OpenMP code (i.e., without MPI), you do not need to set this option (if you do set it, set it to--ntasks=1
). If this parameter is omitted, SLURM assumes--ntasks=1
.
#SBATCH --cpus-per-task=2
Sets the number of CPU cores per task that you’re requesting. Make sure that your code can use multiple cores before requesting more than one. When running OpenMP code (with or without MPI),
--cpus-per-task
should be set to the value of the environment variableOMP_NUM_THREADS
, i.e., the number of threads that you want to use. If this parameter is omitted, SLURM assumes--cpus-per-task=2
.
#SBATCH --nodes=1
Requests that the cores are all on node. Only change this to >1 if you know your code uses a protocol like MPI. SLURM makes no assumptions on this parameter – if you request more than one task (
--ntasks
> 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).
#SBATCH --time=5
Specifies the running time for the job in minutes. You can also use the convenient format D-HH:MM. If your job runs longer than the value you specify here, it will be cancelled.
#SBATCH --gres=gpu:a30:2
Requests type and number of GPUs.
#SBATCH --mem=50G
Will specify the amount of memory that you will be using for your job. Default units are megabytes. Different units can be specified using the suffix [K|M|G|T]. There are two main options,
--mem-per-cpu
and--mem
. The--mem
option specifies the total memory pool for one or more cores, and is the recommended option to use. If you must do work across multiple compute nodes (e.g., MPI code), then you must use the--mem-per-cpu
option, as this will allocate the amount specified for each of the cores you’re requested, whether it is on one node or multiple nodes. If this parameter is omitted, the smallest amount is allocated, usually 100 MB. And chances are good that your job will be killed as it will likely go over this amount.
#SBATCH --output=hostname_%j.out
Specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. The
%j
in the filename will be substituted by the jobID at runtime. If this parameter is omitted, any output will be directed to a file namedSLURM-JOBID.out
in the current directory. The directory/folder where your file should be created need to be created when it is not already there. When the folder is missing your job will fail within 1s and does not show up on your job list. Please make sure this is not writing to the Home directory.
#SBATCH --error=hostname_%j.err
Specifies the file to which standard error will be appended. SLURM submission and processing errors will also appear in the file. The
%j
in the filename will be substituted by the jobID at runtime. If this parameter is omitted, any output will be directed to the file specified by the previous argument--output
. Please make sure this is not writing to the Home directory.
#SBATCH --mail-type=END
Because jobs are processed in the “background” and can take some time to run, it is useful send an email message when the job has finished (
--mail-type=END
). Email can also be sent for other processing stages (START, FAIL) or at all of the times (ALL)
#SBATCH --mail-user=user@abc.com
The email address to which the --mail-type messages will be sent.
If the command was executed successfully, the Job ID will be returned as follows:
Submitted batch job 65648.
It is a good idea to note down your Job ID's.
scontrol show job $SLURM_JOB_ID
in the beginning of your job scripts so information about the job is printed to the out file. This will be useful for debugging in case the job fails.These are several ways to interactively use compute nodes, please pay attention to the last option as it can handle connection interuptions and resume shell sessions (preferred way).
An interactive session will, once it starts, block the entire requested resources unless earlier exited from, even if unused. To avoid unnecessary charges to your account, don't forget to exit an interactive session once finished. Or specify a time limit for your interactive job so it automatically exits once the limit is reached.
Before continuing, note that to access the compute nodes through the login node, you must either forward your SSH agent connection or use the login node as a jump host.
Here's how to do it:
[ubuntu@your_laptop ~]$ ssh -A username@l1.m3c.uni-tuebingen.de
# Then, SSH from the login node to the allocated compute node. For example, for c002:
[username@login1 ~]$ ssh c002.m3s
# Don't forget to include .m3s at the end!
[username@your_laptop ~]$ ssh -J username@l1.m3c.uni-tuebingen.de username@c002.m3s
# Jumping lands you directly on the compute node.
For working interactively with Jupyter notebooks please read the Slurm Tips section.
srun --pty bash
srun
uses most of the options available to sbatch. When the interactive allocation starts, a new bash session will start up on one of the granted nodes. Example:
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[username@login1 ~]$ hostname
login1
[username@login1 ~]$ srun --job-name "InteractiveJob" --ntasks=1 --nodes=1 --time 1:00:00 --pty bash
[username@c001 ~]$ hostname
c001
[username@c001 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
87637 cpu1 Interact username R 0:16 1 c001
[username@c001 ~]$ exit
exit
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[username@login1 ~]$
squeue
lists no jobs running in the beginning. Then the interactive job was launched with the srun --pty bash
command. Note the host name in the prompt changed and now you can run your commands on the allocated node. Now squeue
lists job 87637
as running. Finally to exit the interactive job, type exit. After exiting squeue
shows no jobs running.
salloc
salloc
uses most of the options available to sbatch. salloc
functions similar to srun --pty bash
in that it will add your resource request to the queue. However once the allocation starts, a new bash session will start up on the login node. To run commands on the allocated node you need to use srun
. Example:
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[username@login1 ~]$ salloc --partition=cpu1 --nodes=2 --job-name="InteractiveJob" --time=0:00:30
salloc: Granted job allocation 87638
salloc: Waiting for resource configuration
salloc: Nodes c00[1-2] are ready for job
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
87638 cpu1 Interact username R 0:09 2 c00[1-2]
[username@login1 ~]$ hostname
login1
[username@login1 ~]$ srun hostname
c001
c002
[username@login1 ~]$ exit
exit
salloc: Relinquishing job allocation 87638
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[username@login1 ~]$
squeue
lists no jobs running in the beginning. Then interactive job is launched with salloc
. squeue
now shows job 87638
as running. Running hostname
simply outputs the hostname of the login node. However prefixing hostname
with srun
runs it on the allocated nodes instead (in parallel) and outputs their hostnames. Finally to exit the interactive job, type exit. squeue
now shows no jobs running.
salloc --no-shell
with optionally screen
This methods allows to resume shell/jobs sessions in case of connection losses. For example if you start at work and want to continue at home. Even if the login servers has to be rebooted.
To achieve this start with
[username@login1 ~]$ salloc --partition=gpu-a30 --gres=gpu:1 --job-name="InteractiveJob" --no-shell
salloc: Granted job allocation 87639
salloc: Waiting for resource configuration
salloc: Nodes g025 are ready for job
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
87639 gpu-a30 Interact username R 0:09 1 g025
which does not start a new shell on the login node. To actually use the ressources simply ssh to the node, this - including some magic - will place you in the sandbox of the allocation where you can excecute what you want. Amongst other things you can use screen
or tmux
there to start resumable shell sessions.
[username@login1 ~]$ ssh g025
Last login: Tue Mar 29 17:22:45 2024 from 192.168.43.211
[username@g025 ~]$ echo $SLURM_JOB_ID
87639
[username@g025 ~]$ nvidia-smi # check that the allocation has 1 A30 gpu, as requested above
Wed Apr 10 11:07:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A30 On | 00000000:21:00.0 Off | 0 |
| N/A 40C P0 31W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
[username@g025 ~]$ screen
# Now, you are inside a screen session...
# Detach from it with 'CTRL-a d'
[detached from 72849.pts-0.g025]
[username@g025 ~]$ exit
[username@login1 ~]$ ssh g025
Last login: Wed Apr 10 11:07:59 2024 from 192.168.43.211
[username@g025 ~]$ screen -r # resume the previous session
...
To actually stop the allocation you have to cancel the job with scancel 87639
.
Once submitted, the job will be queued for some time, depending on how many jobs are presently submitted. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the nodes determined by its resource requirements. The status of the job can be queried with the squeue
command.
Option |
Description |
---|---|
-j <jobid> |
Display information for the specified job ID. |
-j <jobid> -o %all |
Display all information fields (with a vertical bar separating each field) for the specified job ID. |
--start |
Shows the approximate time your job will run. |
-l |
Display information in long format. |
-n <job_name> |
Display information for the specified job name. |
-t <state_list> |
Display jobs that have the specified state(s). Valid jobs states include PENDING , RUNNING , SUSPENDED , COMPLETED ,CANCELLED , FAILED , TIMEOUT , NODE_FAIL , PREEMPTED , BOOT_FAIL , DEADLINE , OUT_OF_MEMORY , COMPLETING , CONFIGURING , RESIZING , REVOKED , and SPECIAL_EXIT . |
For example, to see pending jobs run:
squeue -t PENDING
You can also use sstat
command to get info on a running job as follows:
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
You can use sacct
to get details of a previously run job. Examples:
sacct -j 15370
sacct -j 15370 --format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,NodeList
If you're interested in learning how to requeue or hold jobs, keep reading!
You can use scontrol
to requeue a running, suspended or finished slurm batch job into pending state as follows:
scontrol requeue comma_separated_list_of_job_IDs
When a job is requeued, the batch script is initiated from its beginning.
You can also include the "--requeue" option in your batch script as follows:
#SBATCH --requeue
specifies that the batch job should eligible to being requeue.
The job may be requeued explicitly by us, after node failure, or upon preemption by a higher priority job.
You can prevent a pending job from starting by holding it as follows:
scontrol hold comma_separated_list_of_job_IDs
and then use the release command to permit scheduling it:
scontrol release comma_separated_list_of_job_IDs