Using resources effectively
Overview
Teaching: 30 min
Exercises: 15 minQuestions
How can I review past jobs?
How can I use this knowledge to create a more accurate submission script?
Objectives
Understand how to look up job statistics and profile code.
Understand job size implications.
Understand problems and limitations involved in using multiple CPUs.
What Resources?
Last time we submitted a job, we did not specify a number of CPUs, and therefore
we were provided the default of 2
(1 core).
As a reminder, our slurm script example-job.sl
currently looks like this.
#!/bin/bash -e
#SBATCH --job-name my_job
#SBATCH --account nesi99991
#SBATCH --mem 300M
#SBATCH --time 00:15:00
module purge
module load R/4.3.1-gimkl-2022a
Rscript array_sum.r
echo "Done!"
We will now submit the same job again with more CPUs.
We ask for more CPUs using by adding #SBATCH --cpus-per-task 4
to our script.
Your script should now look like this:
#!/bin/bash -e
#SBATCH --job-name my_job
#SBATCH --account nesi99991
#SBATCH --mem 300M
#SBATCH --time 00:15:00
#SBATCH --cpus-per-task 4
module purge
module load R/4.3.1-gimkl-2022a
Rscript array_sum.r
echo "Done!"
And then submit using sbatch
as we did before.
[yourUsername@mahuika ~]$ sbatch example-job.sl
Submitted batch job 23137702
Watch
We can prepend any command with
watch
in order to periodically (default 2 seconds) run a command. e.g.watch squeue --me
will give us up to date information on our running jobs. Care should be used when usingwatch
as repeatedly running a command can have adverse effects. Exitwatch
with ctrl + c.
Note in squeue, the number under cpus, should be ‘4’.
Checking on our job with sacct
.
Oh no!
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
27323464 my_job large nesi99991 4 OUT_OF_ME+ 0:125
27323464.ba+ batch nesi99991 4 OUT_OF_ME+ 0:125
27323464.ex+ extern nesi99991 4 COMPLETED 0:0
To understand why our job failed, we need to talk about the resources involved.
Understanding the resources you have available and how to use them most efficiently is a vital skill in high performance computing.
Below is a table of common resources and issues you may face if you do not request the correct amount.
Not enough | Too Much | |
---|---|---|
CPU | The job will run more slowly than expected, and so may run out of time and get killed for exceeding its time limit. | The job will wait in the queue for longer. You will be charged for CPUs regardless of whether they are used or not. Your fair share score will fall more. |
Memory | Your job will fail, probably with an 'OUT OF MEMORY' error, segmentation fault or bus error (may not happen immediately). | The job will wait in the queue for longer. You will be charged for memory regardless of whether it is used or not. Your fair share score will fall more. |
Walltime | The job will run out of time and be terminated by the scheduler. | The job will wait in the queue for longer. |
Measuring Resource Usage of a Finished Job
Since we have already run a job (successful or otherwise), this is the best source of info we currently have.
If we check the status of our finished job using the sacct
command we learned earlier.
[yourUsername@mahuika ~]$ sacct
JobID JobName Alloc Elapsed TotalCPU ReqMem MaxRSS State
--------------- ---------------- ----- ----------- ------------ ------- -------- ----------
31060451 example-job.sl 2 00:00:48 00:33.548 1G CANCELLED
31060451.batch batch 2 00:00:48 00:33.547 102048K CANCELLED
31060451.extern extern 2 00:00:48 00:00:00 0 CANCELLED
With this information, we may determine a couple of things.
Memory efficiency can be determined by comparing ReqMem (requested memory) with MaxRSS (maximum used memory), MaxRSS is given in KB, so a unit conversion is usually required.
So for the above example we see that 0.1GB (102048K) of our requested 1GB meaning the memory efficincy was about 10%.
CPU efficiency can be determined by comparing TotalCPU(CPU time), with the maximum possible CPU time. The maximum possible CPU time equal to Alloc (number of allocated CPUs) multiplied by Elapsed (Walltime, actual time passed).
For the above example 33 seconds of computation was done,
where the maximum possible computation time was 96 seconds (2 CPUs multiplied by 48 seconds), meaning the CPU efficiency was about 35%.
Time Efficiency is simply the Elapsed Time divided by Time Requested.
48 seconcds out of 15 minutes requested give a time efficiency of about 5%
Efficiency Exercise
Calculate for the job shown below,
JobID JobName Alloc Elapsed TotalCPU ReqMem MaxRSS State --------------- ---------------- ----- ----------- ------------ ------- -------- ---------- 37171050 Example-job 8 00:06:03 00:23:04 32G FAILED 37171050.batch batch 8 00:06:03 23:03.999 14082672k FAILED 37171050.extern extern 8 00:06:03 00:00.001 0 COMPLETED
a. CPU efficiency.
b. Memory efficiency.
Solution
a. CPU efficiency is
( 23 / ( 8 * 6 ) ) x 100
or around 48%.b. Memory efficiency is
( 14 / 32 ) x 100
or around 43%.
For convenience, NeSI has provided the command nn_seff <jobid>
to calculate Slurm Efficiency (all NeSI commands start with nn_
, for NeSI NIWA).
[yourUsername@mahuika ~]$ nn_seff <jobid>
Job ID: 27323570
Cluster: mahuika
User/Group: username/username
State: COMPLETED (exit code 0)
Cores: 1
Tasks: 1
Nodes: 1
Job Wall-time: 5.11% 00:00:46 of 00:15:00 time limit
CPU Efficiency: 141.30% 00:01:05 of 00:00:46 core-walltime
Mem Efficiency: 93.31% 233.29 MB of 250.00 MB
Knowing what we do now about job efficiency, lets submit the previous job again but with more appropriate resources.
#!/bin/bash -e
#SBATCH --job-name my_job
#SBATCH --account nesi99991
#SBATCH --mem 300M
#SBATCH --time 00:15:00
#SBATCH --cpus-per-task 4
module purge
module load R/4.3.1-gimkl-2022a
Rscript array_sum.r
echo "Done!"
[yourUsername@mahuika ~]$ sbatch example-job.sl
Hopefully we will have better luck with this one!
A quick description of Simultaneous Multithreading - SMT (aka Hyperthreading)
Modern CPU cores have 2 threads of operation that can execute independently of one another. SMT is the technology that allows the 2 threads within one physical core to present as multiple logical cores, sometimes referred to as virtual CPUS (vCPUS).
Note: Hyperthreading is Intel’s marketing name for SMT. Both Intel and AMD CPUs have SMT technology.
Some types of processes can take advantage of multiple threads, and can gain a performance boost. Some software is specifically written as multi-threaded. You will need to check or test if your code can take advantage of threads (we can help with this).
However, because each thread shares resources on the physical core,
there can be conflicts for resources such as onboard cache.
This is why not all processes get a performance boost from SMT and in fact can
run slower. These types of jobs should be run without multithreading. There
is a Slurm parameter for this: --hint=nomultithread
SMT is why you are provided 2 CPUs instead of 1 as we do not allow 2 different jobs to share a core. This also explains why you will sometimes see CPU efficiency above 100%, since CPU efficiency is based on core and not thread.
For more details please see our documentation on Hyperthreading
Measuring the System Load From Currently Running Tasks
On Mahuika, we allow users to connect directly to compute nodes from the login node. This is useful to check on a running job and see how it’s doing, however, we only allow you to connect to nodes on which you have running jobs.
The most reliable way to check current system stats is with htop
.
htop
is an interactive process viewer that can be launched from command line.
Finding job node
Before we can check on our job, we need to find out where it is running.
We can do this with the command squeue --me
, and looking under the ‘NODELIST’ column.
[yourUsername@mahuika ~]$ squeue --me
JOBID USER ACCOUNT NAME CPUS MIN_MEM PARTITI START_TIME TIME_LEFT STATE NODELIST(REASON)
26763045 cwal219 nesi99991 test 2 512M large May 11 11:35 14:46 RUNNING wbn144
Now that we know the location of the job (wbn189) we can use ssh
to run htop
on that node.
[yourUsername@mahuika ~]$ ssh wbn189 -t htop -u $USER
You may get a message:
ECDSA key fingerprint is SHA256:############################################
ECDSA key fingerprint is MD5:9d:############################################
Are you sure you want to continue connecting (yes/no)?
If so, type yes
and Enter
You may also need to enter your cluster password.
If you cannot connect, it may be that the job has finished and you have lost permission to ssh
to that node.
Reading Htop
You may see something like this,
top - 21:00:19 up 3:07, 1 user, load average: 1.06, 1.05, 0.96
Tasks: 311 total, 1 running, 222 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.2 us, 3.2 sy, 0.0 ni, 89.0 id, 0.0 wa, 0.2 hi, 0.2 si, 0.0 st
KiB Mem : 16303428 total, 8454704 free, 3194668 used, 4654056 buff/cache
KiB Swap: 8220668 total, 8220668 free, 0 used. 11628168 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1693 jeff 20 0 4270580 346944 171372 S 29.8 2.1 9:31.89 gnome-shell
3140 jeff 20 0 3142044 928972 389716 S 27.5 5.7 13:30.29 Web Content
3057 jeff 20 0 3115900 521368 231288 S 18.9 3.2 10:27.71 firefox
6007 jeff 20 0 813992 112336 75592 S 4.3 0.7 0:28.25 tilix
1742 jeff 20 0 975080 164508 130624 S 2.0 1.0 3:29.83 Xwayland
1 root 20 0 230484 11924 7544 S 0.3 0.1 0:06.08 systemd
68 root 20 0 0 0 0 I 0.3 0.0 0:01.25 kworker/4:1
2913 jeff 20 0 965620 47892 37432 S 0.3 0.3 0:11.76 code
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
Overview of the most important fields:
PID
: What is the numerical id of each process?USER
: Who started the process?RES
: What is the amount of memory currently being used by a process (in bytes)?%CPU
: How much of a CPU is each process using? Values higher than 100 percent indicate that a process is running in parallel.%MEM
: What percent of system memory is a process using?TIME+
: How much CPU time has a process used so far? Processes using 2 CPUs accumulate time at twice the normal rate.COMMAND
: What command was used to launch a process?
To exit press q.
Running this command as is will show us information on tasks running on the login node (where we should not be running resource intensive jobs anyway).
Running Test Jobs
As you may have to run several iterations before you get it right, you should choose your test job carefully. A test job should not run for more than 15 mins. This could involve using a smaller input, coarser parameters or using a subset of the calculations. As well as being quick to run, you want your test job to be quick to start (e.g. get through queue quickly), the best way to ensure this is keep the resources requested (memory, CPUs, time) small. Similar as possible to actual jobs e.g. same functions etc. Use same workflow. (most issues are caused by small issues, typos, missing files etc, your test job is a jood chance to sort out these issues.). Make sure outputs are going somewhere you can see them.
Serial Test
Often a good first test to run, is to execute your job serially e.g. using only 1 CPU. This not only saves you time by being fast to start, but serial jobs can often be easier to debug. If you confirm your job works in its most simple state you can identify problems caused by paralellistaion much more easily.
You generally should ask for 20% to 30% more time and memory than you think the job will use. Testing allows you to become more more precise with your resource requests. We will cover a bit more on running tests in the last lesson.
Efficient way to run tests jobs using debug QOS (Quality of Service)
Before submitting a large job, first submit one as a test to make sure everything works as expected. Often, users discover typos in their submit scripts, incorrect module names or possibly an incorrect pathname after their job has queued for many hours. Be aware that your job is not fully scanned for correctness when you submit the job. While you may get an immediate error if your SBATCH directives are malformed, it is not until the job starts to run that the interpreter starts to process the batch script.
NeSI has an easy way for you to test your job submission. One can employ the debug QOS to get a short, high priority test job. Debug jobs have to run within 15 minutes and cannot use more that 2 nodes. To use debug QOS, add or change the following in your batch submit script
#SBATCH --qos=debug #SBATCH --time=15:00
Adding these SBATCH directives will provide your job with the highest priority possible, meaning it should start to run within a few minutes, provided your resource request is not too large.
Initial Resource Requirements
As we have just discussed, the best and most reliable method of determining resource requirements is from testing, but before we run our first test there are a couple of things you can do to start yourself off in the right area.
Read the Documentation
NeSI maintains documentation that does have some guidance on using resources for some software However, as you noticed in the Modules lessons, we have a lot of software. So it is also advised to search the web for others that may have written up guidance for getting the most out of your specific software.
Ask Other Users
If you know someone who has used the software before, they may be able to give you a ballpark figure.
Next Steps
You can use this knowledge to set up the next job with a closer estimate of its load on the system. A good general rule is to ask the scheduler for 30% more time and memory than you expect the job to need.
Key Points
As your task gets larger, so does the potential for inefficiencies.
The smaller your job (time, CPUs, memory, etc), the faster it will schedule.