Running jobs with Slurm
- Objectives
- Slurm basics
- Partitions
- Slurm scripts
- Submitting a job
- Checking completed jobs with sacct
Objectives
You will learn:
- what partitions are and which partitions are available on Mahuika
- how to write Slurm scripts
- how to submit jobs
- how to check the status of running and completed jobs
Slurm basics
All NeSI systems use the Slurm batch scheduler for the submission, control and management of user jobs.
Slurm provides a rich set of features for organising your workload and an extensive array of tools for managing your resource usage. In most cases you need to know the commands:
sbatch
- submit a batch scriptsqueue
- check the status of jobs on the systemscancel
- delete one of your jobs from the queuesrun
- launch a process across multiple CPUssinfo
- view information about Slurm nodes and partitionssacct
- display accounting data for all jobs and job steps in the Slurm job accounting log or Slurm
Reference information about Slurm can be found at:
man slurm
- https://slurm.schedmd.com/
Partitions
Jobs in the Slurm queue have a priority which depends on several factors including size, age, owner, and the “partition” to which they belong. Each partition can be considered as an independent queue, with the slight complications that a job can be submitted to multiple partitions (though it will only run in one of them) and a compute node may belong to multiple partitions.
Partitions on Mahuika
Important Note: This page is subject to change, particularly numerical values.
Partition Name | Time limit | CPU cores | Max cores per user | RAM / core | Purpose | |
---|---|---|---|---|---|---|
large | 3 days | 8424 | 1024 | 3 GB | Default partition, allows large core count jobs. | |
long | 3 weeks | 1872 | 720 | 3 GB | Corrals long-duration jobs into a subset of the compute nodes. | |
prepost | 3 hours | 36 | 4 | 15 GB | Short jobs only. More memory per CPU. | |
bigmem | 3 days | 108 | 72 | 15 GB | Standard partition for all other “large memory” jobs. | |
hugemem | 3 days | 64 | 64 | 62 GB | The 4TB node – when it is available for batch processing. | |
gpu | 3 days | 12 | 4 + 2 GPUs | 3 GB | 2 GPGPUs per node. |
Other limits
No individual job can request more than 20,000 CPU hours.
By default no user can put more than 1000 jobs in the queue at a time. This limit will be relaxed for those who need to submit large numbers of jobs, provided that they undertake to do it with job arrays.
Quality of Service
In addition to the partitions, each job has a “QoS”, with the default QoS for a job being determined by the allocation class of its project. Specifying --qos=debug
will override that and give the job very high priority, but is subject to strict limits: 15 minutes per job, and only 1 job at a time per user.
Slurm scripts
Slurm scripts are text files you will need to create in order to submit a job to the scheduler. Slurm scripts start with #!/bin/bash
and contain set of directives (start with #SBATCH
, followed by commands (srun
):
#!/bin/bash
#SBATCH --job-name=JobName # job name (shows up in the queue)
#SBATCH --account=nesi99999 # Project Account
#SBATCH --time=08:00:00 # Walltime (HH:MM:SS)
#SBATCH --mem-per-cpu=1500 # memory/cpu (in MB)
#SBATCH --ntasks=2 # number of tasks (e.g. MPI)
#SBATCH --cpus-per-task=4 # number of cores per task (e.g. OpenMP)
#SBATCH --partition=long # specify a partition
#SBATCH --hint=nomultithread # don't use hyperthreading
srun [options] <executable> [options]
Not all directives need to be specified, just the ones the job requires.
Launching job steps with srun
The srun
command runs the executable along with its options, within the resources allocated to the job.
For MPI jobs, srun
sets up the MPI runtime environment needed to run the parallel program, launching it on multiple CPUs, which can be on different nodes. srun
should be used in place of any other MPI launcher, such as aprun
or mpirun
.
Commonly used Slurm environment variables
These can be useful within Slurm scripts:
$SLURM_JOB_ID
(job id)$SLURM_NNODES
(number of nodes)$SLURM_NTASKS
(number of MPI tasks)$SLURM_CPUS_PER_TASK
(CPUs per MPI task)$SLURM_SUBMIT_DIR
(directory job was submitted from)$SLURM_ARRAY_JOB_ID
(job id for the array)$SLURM_ARRAY_TASK_ID
(job array index value)
MPI jobs
For MPI jobs you need to set --ntasks
to a value larger than 1, or if you want more control of task layout set --ntasks-per-node
and --node
instead.
OpenMP jobs
For OpenMP jobs you need to set --cpus-per-task
to a value larger than 1. Our Slurm prolog will then set OMP_NUM_THREADS equal to that number.
Hyperthreading
Hyperthreading is enabled on NeSI’s platforms.
By default, Slurm schedules multithreaded jobs using hyperthreads (logical cores, or “CPUs” in Slurm nomenclature), of which there are two for each physical core, so 72 and 80 per node on Mahuika and Māui, respectively.
To turn hyperthreading off you can use the srun
option --hint=nomultithread
, or to ensure that it is on --hint=multithread
. We recommend that any job with --cpus-per-task
greater than 1 set one or the other of those options. Like other srun
options --hint
can also be given to sbatch
as a directive or command line option, and it will then be inherited (via the environment) by any occurrences of srun
within the job.
#SBATCH --hint=nomultithread
Even though hyperthreading is enabled, resources will by default be allocated to jobs and their tasks at the level of a physical core, so two different jobs or job tasks will not share a physical core. For example, a job requesting resources for three threads will be allocated two full physical cores.
Important: Hyperthreading can be beneficial for some codes, but it can also degrade performance in other cases. We therefore recommend to run a small test job with and without hyperthreading to determine the best choice.
Requesting memory
Due to hyperthreading being enabled, a non-hyperthreaded job will use twice as many logical CPUs as it has threads, and so the value given to --mem-per-cpu
should be half of what it would have been on Pan. On an ordinary Mahuika compute node there is 3 GB available per physical core, so 1.5 GB per logical core.
Accounting for CPUs, Memory, and GPUs
Because Slurm is configured to allow the use logical cores you will notice that even a serial job reports using 2 CPUs. These numbers will be halved before being used in our project accounting, which is still based on physical core hours.
Jobs which use more memory per CPU than is indicated in the table above will be counted as having occupied the equivalent number of CPUs. Also “bigmem” CPUs count as 2 ordinary CPUs and “hugemem” CPUs as 4. GPUs count as 56 CPUs.
Mahuika Infiniband Islands
Mahuika’s network consists of a number of Infiniband Islands, each containing 26 nodes or 936 physical cores. Parallel jobs that run entirely within an InfiniBand Island will achieve better application scaling performance than those that cross InfiniBand Island boundaries.
Users can request that the job run within an InfiniBand Island by adding the sbatch
flag #SBATCH --switches=1
to their batch script. We advise that you manually set a maximum waiting time for the selected number of switches, e.g. #SBATCH --switches=1@01:00:00
will make the scheduler wait for maximum one hour before ignoring the switches request.
Submitting a job
Use sbatch <script>
to submit the job. All Slurm directives can alternatively be specified at the command line, e.g. sbatch --account=nesi12345 <script>
.
Try submitting a simple job
Submit job helloworld.sl
:
#!/bin/bash
#SBATCH --job-name=hello
#SBATCH --time=00:02:00
srun echo "Hello, World!"
with sbatch --account=nesi12345 helloworld.sl
where nesi12345 is your NeSI project’s code. If you only have one project then you don’t need to specify it.
Checking completed jobs with sacct
Another useful Slurm command is sacct
which retrieves information about completed jobs. For example:
sacct -j 14309
where the argument passed to -j
is the job ID, will show us something like:
JobID JobName Elapsed TotalCPU AllocCPUS MaxRSS State
------------ ---------- ---------- ---------- ---------- ---------- ----------
14309 problem.sh 00:12:42 00:00.012 80 COMPLETED
14309.batch batch 00:12:42 00:00.012 80 1488K COMPLETED
14309.0 yourapp 00:12:41 16:00:03 80 478356K COMPLETED
By default sacct
will list all of your jobs which were (or are) running on the current day. Each job will show as more than one line (unless -X
is specified): an initial line for the job as a whole, and then an additional line for each job step, i.e.: the batch process which is your executing script, and then each of the srun
commands it executes.
By changing the displayed columns you can gain different information about the job, for example
sacct -j 14309 --format=jobid,jobname,partition,alloctres,exitcode