In general, there are two methods for executing jobs within the cluster:
In the non-interactive approach, you create a script using bash, Python, or similar languages, define the job parameters, and submit it using the sbatch command.
In this scenario, Slurm reserves the necessary resources for your job and executes it on the compute nodes when the required resources become available.
With the interactive method, Slurm reserves computing resources for you and grants you shell access, allowing you to directly execute commands.
Before submitting any job for execution, it's advisable to check the overall status of the available resources and Slurm itself.
The command sinfo provides status information of nodes in various partitions (also the associated time limit for each partition).
The default partition is marked with an "*".
This information can be useful in deciding where to submit your job. Status codes (abbreviated form) are explained below:
| Status code | Meaning |
|---|---|
| alloc | The node has been allocated to one or more jobs. |
| mix | The node has some of its CPUs ALLOCATED while others are IDLE. |
| idle | The node is not allocated to any jobs and is available for use. |
| down | node is down and unavailable for use. |
| drain | The node is unavailable for use per system administrator request. (for maintenance etc.) |
| drng | means that the node is being drained but is still running a user job. The node will be marked as drained right after the user job is finished. Do not worry if you have a job running on a node with this state. |
The main way to run jobs is by submitting a script with the sbatch command. The command to submit a job is as simple as:
sbatch myscript.sh
The commands specified in the myscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won’t stop if you disconnect from the cluster.
An sbatch script is just a script with additional #SBATCH directives to tell the scheduler about your job's requirements. A typical submission script, in this case myscript.sh running a program called myprog with arguments, will look like this:
#!/bin/bash
#SBATCH --partition=gpu-a30 # Request a specific partition
#SBATCH --ntasks=1 # Number of tasks (see below)
#SBATCH --cpus-per-task=2 # Number of CPU cores per task
#SBATCH --nodes=1 # Ensure that all cores are on one machine
#SBATCH --time=0-00:05 # Runtime in D-HH:MM
#SBATCH --gres=gpu:2 # Optionally type and number of gpus
#SBATCH --mem=50G # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=hostname_%j.out # File to which STDOUT will be written
#SBATCH --error=hostname_%j.err # File to which STDERR will be written
#SBATCH --mail-type=END # Type of email notification - BEGIN,END,FAIL,ALL
#SBATCH --mail-user=user@abc.com # Email to which notifications will be sent
# Print info about current job - makes debug easy
scontrol show job $SLURM_JOB_ID
# Insert your commands here
./myprog $arg1 $arg2
Now the above script can be submitted as:
arg1=val1 arg2=val2 sbatch myscript.sh
# or by modifing options (e.g partition) in the command line instead:
arg1=val1 arg2=val2 sbatch -p gpu-a30 myscript.sh
In general, the script is composed of 3 parts.
#!/bin/bash line allows the script to be run as a bash script - can be python or ...#SBATCH lines are technically bash comments, but are interpreted as various parameters for the SLURM scheduler#SBATCH --partition=gpu-a30Specifies the SLURM partition (a.k.a. queue) under which the script will be run.
#SBATCH --ntasks=1Sets the number of tasks that you’re requesting. Make sure that your code can use multiple cores before requesting more than one. When running MPI code,
--ntasksshould be set to the number of parallel MPI programs you want to start (i.e., the same number you give as-nparameter tompirun). When running OpenMP code (i.e., without MPI), you do not need to set this option (if you do set it, set it to--ntasks=1). If this parameter is omitted, SLURM assumes--ntasks=1.
#SBATCH --cpus-per-task=2Sets the number of CPU cores per task that you’re requesting. Make sure that your code can use multiple cores before requesting more than one. When running OpenMP code (with or without MPI),
--cpus-per-taskshould be set to the value of the environment variableOMP_NUM_THREADS, i.e., the number of threads that you want to use. If this parameter is omitted, SLURM assumes--cpus-per-task=2.
#SBATCH --nodes=1Requests that the cores are all on node. Only change this to >1 if you know your code uses a protocol like MPI. SLURM makes no assumptions on this parameter – if you request more than one task (
--ntasks> 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).
#SBATCH --time=5Specifies the running time for the job in minutes. You can also use the convenient format D-HH:MM. If your job runs longer than the value you specify here, it will be cancelled.
#SBATCH --gres=gpu:a30:2Requests type and number of GPUs.
#SBATCH --mem=50GWill specify the amount of memory that you will be using for your job. Default units are megabytes. Different units can be specified using the suffix [K|M|G|T]. There are two main options,
--mem-per-cpuand--mem. The--memoption specifies the total memory pool for one or more cores, and is the recommended option to use. If you must do work across multiple compute nodes (e.g., MPI code), then you must use the--mem-per-cpuoption, as this will allocate the amount specified for each of the cores you’re requested, whether it is on one node or multiple nodes. If this parameter is omitted, the smallest amount is allocated, usually 100 MB. And chances are good that your job will be killed as it will likely go over this amount.
#SBATCH --output=hostname_%j.outSpecifies the file to which standard out will be appended (to truncate file at the start of the job use
#SBATCH --open-mode=truncate). If a relative file name is used, it will be relative to your current working directory. The%jin the filename will be substituted by the jobID at runtime. There are other placeholders (see official documentation). If--ouputis omitted, any output will be directed to a file namedSLURM-JOBID.outin the directory the job is submitted from. Any directory in the specified path must exist, otherwise the job will silently (without ouput) fail when it starts. The job is submitted and staysPENDINGwithout detecting this error.
#SBATCH --error=hostname_%j.errSpecifies the file to which standard error will appended. SLURM submission and processing errors will also appear in the file. If this parameter is omitted, any output will be directed to the file specified by the previous argument
--output. Has the same characteristics as--ouput.
#SBATCH --mail-type=ENDBecause jobs are processed in the “background” and can take some time to run, it is useful send an email message when the job has finished (
--mail-type=END). Email can also be sent for other processing stages (START, FAIL) or at all of the times (ALL). Note that if you set this you also need to specify a valid email address with--mail-userotherwise slurm does not know the mail destination.
#SBATCH --mail-user=user@abc.comThe email address to which the --mail-type messages will be sent. The address is validated before it is used.
If the command was executed successfully, the Job ID will be returned as follows:
Submitted batch job 65648.
It is a good idea to note down your Job ID's.
scontrol show job $SLURM_JOB_ID in the beginning of your job scripts so information about the job is printed to the out file. This will be useful for debugging in case the job fails.These are several ways to interactively use compute nodes, each having their own requirements, restrictions and benefits. We recommend to use a screen or tmux session for each interactive job. If you want to have a very long running interactive job pay attention to the last option as it can also handle login node reboots.
An interactive session will, once it starts, block the entire requested resources until it is terminated, even if unused. To avoid unnecessary charges to your account and blocking resource for others, don't forget to exit an interactive session once your task finished. Or specify a time limit for your interactive job so it automatically exits once the limit is reached.
Before continuing, note that to access the compute nodes through login nodes, you may either need to forward your SSH agent connection or use the login node as a jump host.
Here's how to do it:
[ubuntu@your_laptop ~]$ ssh -A username@l1.m3c.uni-tuebingen.de
# Then, SSH from the login node to the allocated compute node. For example, for c002:
[username@login1 ~]$ ssh c002.m3s # Don't forget to include .m3s at the end!
[ubuntu@your_laptop ~]$ ssh -J username@l1.m3c.uni-tuebingen.de username@c002.m3s
# Jumping lands you directly on the compute node.
For working interactively with Jupyter notebooks please read the Software section.
screenscreen or alternatively tmux (see man tmux) are useful for all of the following interactive job methods: They can be used to start sessions on login and/or compute nodes, which can be resumed if the connection to the screen host fails.
A small example:
[username@login1 ~]$ screen
[username@login1 ~]$ screen # should clear your shell window
# Now, inside the screen session...
[username@login1 ~]$ screen -ls
There are screens on:
9322.pts-1.login1 (Attached)
1 Sockets in /run/screen/S-root.
[username@login1 ~]$ longprogram2
OUTPUT...
# Detach from screen with 'CTRL-a d' (the program will continue to run)
[detached from 9322.pts-1.login1]
# Now, outside the screen session you should see the previous shell content again
[username@login1 ~]$ screen -m -d -S nu-session # start a 2nd _named_ session in _detached_ mode
[username@login1 ~]$ screen -ls
There are screens on:
9322.pts-1.login1 (Detached)
12629.nu-session (Detached)
2 Sockets in /run/screen/S-root.
[username@login1 ~]$ screen -r 9322.pts-1.login1 # reattach to the 1st session
[username@login1 ~]$ exit # quit it
[screen is terminating]
[username@login1 ~]$ screen -r # reattach to the only remaining session
...
One can also login to the screen host (in this case login1) from a different client and resume or take over running screen sessions with the new connection:
[privateuser@pc] ~]$ ssh m3l1
[username@login1 ~]$ screen -ls
There are screens on:
12629.nu-session (Attached)
1 Sockets in /run/screen/S-root.
[username@login1 ~]$ screen -r -d # Reattach the session and if necessary detach it first
# Now, inside the screen session...
[username@login1 ~]$
See man screen for the full functionality.
srun --pty bashsrun uses most of the options available to sbatch. When the interactive job starts, a new bash session will start up on one of the allocated nodes. The job
Usage example:
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[username@login1 ~]$ hostname
login1
[username@login1 ~]$ screen
# Now, inside the screen session...
[username@login1 ~]$ srun --job-name "InteractiveJob" --ntasks=1 --nodes=1 --time 1:00:00 --pty bash
[username@c001 ~]$ hostname
c001
[username@c001 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
87637 cpu1 Interact username R 0:16 1 c001
[username@c001 ~]$ exit # the job
exit
[username@login1 ~]$ exit # the screen session
[screen is terminating]
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[username@login1 ~]$
squeue lists no jobs running in the beginning. Start a screen session before the interactive job is launched with the srun --pty bash command. Note the host name in the prompt changes and now you can run your commands on the allocated node. Now squeue lists job 87637 as running. Finally to quit the interactive job and the screen session, use exit twice. Afterwards squeue shows no running jobs.
sallocsalloc uses most of the options available to sbatch. salloc functions similar to srun --pty bash in that it will add your resource request to the queue. However once the allocation starts, a new bash session will start up on the login node. The allocation
screen) if the login to compute node connection is lostTo run commands on the allocated node you need to use srun. Example:
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[username@login1 ~]$ screen
# Now, inside the screen session...
[username@login1 ~]$ salloc --partition=cpu1 --nodes=2 --job-name="InteractiveJob" --time=0:00:30
salloc: Granted job allocation 87638
salloc: Waiting for resource configuration
salloc: Nodes c00[1-2] are ready for job
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
87638 cpu1 Interact username R 0:09 2 c00[1-2]
[username@login1 ~]$ hostname
login1
[username@login1 ~]$ srun hostname
c001
c002
[username@login1 ~]$ exit # the allocation
exit
salloc: Relinquishing job allocation 87638
[username@login1 ~]$ exit # the screen session
[screen is terminating]
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[username@login1 ~]$
squeue lists no jobs running in the beginning. Then interactive job is launched with salloc. squeue now shows job 87638 as running. Running hostname simply outputs the hostname of the login node. However prefixing hostname with srun runs it on the allocated nodes instead (in parallel) and outputs their hostnames. To finally quit the allocation use exit: Afterwards squeue shows no jobs running.
salloc --no-shellThis methods
srun within the screen session)To achieve this start with
[username@login1 ~]$ salloc --partition=gpu-a30 --gres=gpu:1 --job-name="InteractiveJob" --no-shell
salloc: Granted job allocation 87639
salloc: Waiting for resource configuration
salloc: Nodes g025 are ready for job
[username@login1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
87639 gpu-a30 Interact username R 0:09 1 g025
which does not start a new shell on the login node. To actually use the ressources simply ssh to the listed compute node, this - including some magic - will place you in the sandbox of the allocation where you can excecute what you want. Amongst other things you can use screen or tmux there to start resumable shell sessions.
[username@login1 ~]$ ssh g025
Last login: Tue Mar 29 17:22:45 2024 from 192.168.43.211
[username@g025 ~]$ echo $SLURM_JOB_ID
87639
[username@g025 ~]$ nvidia-smi # check that the allocation has 1 A30 gpu, as requested above
Wed Apr 10 11:07:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A30 On | 00000000:21:00.0 Off | 0 |
| N/A 40C P0 31W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
[username@g025 ~]$ screen
# Now, you are inside a screen session...
# Detach from it with 'CTRL-a d'
[detached from 72849.pts-0.g025]
[username@g025 ~]$ exit
[username@login1 ~]$ ssh g025
Last login: Wed Apr 10 11:07:59 2024 from 192.168.43.211
[username@g025 ~]$ screen -r # resume the previous session
...
To actually stop the allocation you have to cancel the job with scancel 87639.
Once submitted, the job will be queued for some time, depending on how many jobs are presently submitted. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the nodes determined by its resource requirements. The status of the job can be queried with the squeue command.
Option |
Description |
|---|---|
-j <jobid> |
Display information for the specified job ID. |
-j <jobid> -o %all |
Display all information fields (with a vertical bar separating each field) for the specified job ID. |
--start |
Shows the approximate time your job will run. |
-l |
Display information in long format. |
-n <job_name> |
Display information for the specified job name. |
-t <state_list> |
Display jobs that have the specified state(s). Valid jobs states include PENDING, RUNNING, SUSPENDED, COMPLETED,CANCELLED, FAILED, TIMEOUT, NODE_FAIL, PREEMPTED, BOOT_FAIL, DEADLINE, OUT_OF_MEMORY, COMPLETING, CONFIGURING, RESIZING, REVOKED, and SPECIAL_EXIT. |
For example, to see pending jobs run:
squeue -t PENDING
You can also use sstat command to get info on a running job as follows:
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
You can use sacct to get details of a previously run job. Examples:
sacct -j <jobid>
sacct -j <jobid> --format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,NodeList
The seff command displays efficiency statistics for cpu and memory usage in completed jobs:
seff <jobid>
If you're interested in learning how to requeue or hold jobs, keep reading!
You can use scontrol to requeue a running, suspended or finished slurm batch job into pending state as follows:
scontrol requeue comma_separated_list_of_job_IDs
When a job is requeued, the batch script is initiated from its beginning.
You can also include the "--requeue" option in your batch script as follows:
#SBATCH --requeue
specifies that the batch job should eligible to being requeue.
The job may be requeued explicitly by us, after node failure, or upon preemption by a higher priority job.
You can prevent a pending job from starting by holding it as follows:
scontrol hold comma_separated_list_of_job_IDs
and then use the release command to permit scheduling it:
scontrol release comma_separated_list_of_job_IDs