Slurm is a cluster management and job scheduling system. In simple terms, users reserve the computing power they need (CPU, RAM, GPU) for a specific duration and start their calculations (called jobs). Once the job is completed or the allotted time expires, the resources are returned to the pool so they can be assigned to other users' jobs.
All jobs are placed in a queue, where they are arranged based on job priority. The calculation of priority will be discussed later in this document.
Your entry point to the cluster are the login nodes. You can access them with SSH and use them to submit jobs to the compute nodes.
Be aware that login nodes are resource-constrained and are limited to 40G RAM and 4 CPU cores per user. You should run your calculations on the compute nodes, not the login nodes.
If you want to learn more about Slurm, you can check out the quick start guide at https://slurm.schedmd.com/quickstart.html.
Compute nodes can be categorized by hardware specifications and are part of partitions.
Partitions consist of set of nodes and have additional properties, such as maximum and default time limits, priorities, etc.
You can list all M3 Cluster partitions with sinfo -s
and obtain more detailed information with scontrol show partition
.
Depending on your jobs requirements, you can submit it to one or multiplea suitable partitions.
You can view the queue of a specific partition by using the command squeue -p the_partition_name
.
As of April 2024, there are four partitions in M3:
The Slurm cpu
resource actually is a cpu-thread and not a cpu-core.
For resource requests Slurm will always round up cpu
to multiples of 2.
This guarantees that a job is not sharing cpu-cores with other jobs and also means that the effective minimum per job is cpu=2
.
Other limitations & properties that apply to the partitions:
cpu1 | cpu2-hm | cpu3-long | gpu-a30 | |
---|---|---|---|---|
Maximum time limit for jobs | 1day | 1day | 14days | 1day |
Maximum number of jobs a user can run | - | - | - | - |
Maximum resources a user can occupy | - | - | cpu=200, mem=2.25T | - |
Maximum available resources of a node 1 | cpu=64, mem=455G | cpu=128, mem=1.83T | ? 2 | cpu=96, mem=455G, gpu=2 |
Maximum resources available to the partition | cpu=1280, mem=8.9T | cpu=256, mem=3.5T | cpu=768, mem=5.3T | cpu=576, mem=2.6T, gpu=12 |
Price for a single core job | 1 | 1.3 | 2.6 | 1.7 (if gpu job, then 42 3) |
If you're interested in the specifications of the compute nodes, then:
c001 - c020 | hm023 - hm024 | g025 - g030 | |
---|---|---|---|
CPU | Model: AMD EPYC 7343 16-Core Processor Threads per core: 2 Cores per socket: 16 Sockets: 2 Max MHz: 3940.6250 NUMA node0 CPUs: 0-15,32-47 NUMA node1 CPUs: 16-31,48-63 |
Model: AMD EPYC 7542 32-Core Processor Threads per core: 2 Cores per socket: 32 Sockets: 2 Max MHz: 2900.0000 NUMA node0 CPUs: 0-31,64-95 NUMA node1 CPUs: 32-63,96-127 |
Model: AMD EPYC 74F3 24-Core Processor Threads per core: 2 Cores per socket: 24 Sockets: 2 Max MHz:4037.5000 NUMA node0 CPUs: 0-23,48-71 NUMA node1 CPUs: 24-47,72-95 |
RAM | Type: DIMM DDR4 Synchronous Registered Max MHz: 3200 MHz Module size: 32GiB Total: 512GiB |
Type: DIMM DDR4 Synchronous Registered Max MHz: 3200 MHz Module size: 64GiB Total: 2TiB |
Type: DIMM DDR4 Synchronous Registered Max MHz: 3200 MHz Module size: 32GiB Total: 512GiB |
GPU | 0 | 0 | Product name: NVIDIA A30 Max clocks MHz: Graphics=1440 SM=1440 Memory=1215 Video=1305 FB memory total: 24576 MiB BAR1 memory total: 32768 MiB Per node: 2 GPU |