What is Parallel Computing
Overview
Teaching: 30 min
Exercises: 10 minQuestions
How do we execute a task in parallel?
What benefits arise from parallel execution?
What are the limits of gains from execution in parallel?
What is the difference between implicit and explicit parallelisation.
Objectives
Prepare a job submission script for the parallel executable.
Methods of Parallel Computing
To understand the different types of Parallel Computing we first need to clarify some terms.
CPU: Unit that does the computations.
Task: One or more CPUs that share memory.
Node: The physical hardware. The upper limit on how many CPUs can be in a task.
Shared Memory: When multiple CPUs are used within a single task.
Distributed Memory: When multiple tasks are used.
Which methods are available to you is largely dependent on the nature of the problem and software being used.
Shared-Memory (SMP)
Shared-memory multiproccessing divides work among CPUs or threads, all of these threads require access to the same memory.
Often called Multithreading.
This means that all CPUs must be on the same node, most Mahuika nodes have 72 CPUs.
Shared memory parallelism is what is used in our example script array_sum.r
.
Number of threads to use is specified by the Slurm option --cpus-per-task
.
Shared Memory Example
Create a new script called
smp-job.sl
#!/bin/bash -e #SBATCH --job-name smp-job #SBATCH --account nesi99991 #SBATCH --output %x.out #SBATCH --mem-per-cpu 500 #SBATCH --cpus-per-task 8 echo "I am task #${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs"
then submit with
[yourUsername@mahuika ~]$ sbatch smp-job.sl
Solution
Checking the output should reveal
[yourUsername@mahuika ~]$ cat smp-job.out
I am task #0 running on node 'wbn224' with 8 CPUs
Distributed-Memory (MPI)
Distributed-memory multiproccessing divides work among tasks, a task may contain multiple CPUs (provided they all share memory, as discussed previously).
Message Passing Interface (MPI) is a communication standard for distributed-memory multiproccessing. While there are other standards, often ‘MPI’ is used synonymously with Distributed parallelism.
Each task has it’s own exclusive memory, tasks can be spread across multiple nodes, communicating via and interconnect. This allows MPI jobs to be much larger than shared memory jobs. It also means that memory requirements are more likely to increase proportionally with CPUs.
Distributed-Memory multiproccessing predates shared-memory multiproccessing, and is more common with classical high performance applications (older computers had one CPU per node).
Number of tasks to use is specified by the Slurm option --ntasks
, because the number of tasks ending up on one node is variable you should use --mem-per-cpu
rather than --mem
to ensure each task has enough.
Tasks cannot share cores, this means in most circumstances leaving --cpus-per-task
unspecified will get you 2
.
Distributed Memory Example
Create a new script called
mpi-job.sl
#!/bin/bash -e #SBATCH --job-name mpi-job #SBATCH --account nesi99991 #SBATCH --output %x.out #SBATCH --mem-per-cpu 500 #SBATCH --ntasks 4 srun bash -c 'echo I am task \#${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs'
then submit with
[yourUsername@mahuika ~]$ sbatch mpi-job.sl
Solution
[yourUsername@mahuika ~]$ cat mpi-job.out
I am task #1 running on node 'wbn012' with 2 CPUs I am task #3 running on node 'wbn010' with 2 CPUs I am task #0 running on node 'wbn009' with 2 CPUs I am task #2 running on node 'wbn063' with 2 CPUs
Using a combination of Shared and Distributed memory is called Hybrid Parallel.
Hybrid Example
Create a new script called
hybrid-job.sl
#!/bin/bash -e #SBATCH --job-name hybrid-job #SBATCH --account nesi99991 #SBATCH --output %x.out #SBATCH --mem-per-cpu 500 #SBATCH --ntasks 2 #SBATCH --cpus-per-task 4 srun bash -c 'echo I am task \#${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs'
[yourUsername@mahuika ~]$ sbatch hybrid-job.sl
Solution
[yourUsername@mahuika ~]$ cat hybrid-job.out
I am task #0 running on node 'wbn016' with 4 CPUs I am task #1 running on node 'wbn022' with 4 CPUs
GPGPU’s
GPUs compute large number of simple operation in parallel, making them well suited for Graphics Processing (hence the name), or any other large matrix operations.
On NeSI, GPU’s are specialised pieces of hardware that you request in addition to your CPUs and memory.
You can find an up-to-date(ish) list of GPUs available on NeSI in our Support Documentation
GPUs can be requested using --gpus-per-node=<gpu_type>:<gpu_number>
Depending on the GPU type, we may also need to specify a partition using --partition
.
GPU Job Example
Create a new script called
gpu-job.sl
#!/bin/bash -e #SBATCH --job-name gpu-job #SBATCH --account nesi99991 #SBATCH --output %x.out #SBATCH --mem-per-cpu 2G #SBATCH --gpu-per-node P100:1 module load CUDA nvidia-smi
then submit with
[yourUsername@mahuika ~]$ sbatch gpu-job.sl
Solution
[yourUsername@mahuika ~]$ cat gpu-job.out
Tue Mar 12 19:40:51 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:05:00.0 Off | 0 | | N/A 28C P0 24W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Job Array
Job arrays are not “multiproccessing” in the same way as the previous two methods. Ideal for embarrassingly parallel problems, where there are little to no dependencies between the different jobs.
Can be thought of less as running a single job in parallel and more about running multiple serial-jobs simultaneously. Often this will involve running the same process is run on multiple inputs.
Embarrassingly parallel jobs should be able scale without any loss of efficiency. If this type of parallelisation is an option, it will almost certainly be the best choice.
A job array can be specified using --array
If you are writing your own code, then this is something you will probably have to specify yourself.
Job Array Example
Create a new script called
array-job.sl
#!/bin/bash -e #SBATCH --job-name array-job #SBATCH --account nesi99991 #SBATCH --output %x_%a.out #SBATCH --mem-per-cpu 500 #SBATCH --array 0-3 echo "I am task #${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs"
then submit with
[yourUsername@mahuika ~]$ sbatch array-job.sl
Solution
ls
array-job_0.out array-job_1.out array-job_2.out array-job_3.out
Each of which should contain,
[yourUsername@mahuika ~]$ cat array-job*.out
I am task #0 running on node 'wbn*' with 2 CPUs
How to Utilise Multiple CPUs
Requesting extra resources through Slurm only means that more resources will be available, it does not guarantee your program will be able to make use of them.
Generally speaking, Parallelism is either implicit where the software figures out everything behind the scenes, or explicit where the software requires extra direction from the user.
Scientific Software
The first step when looking to run particular software should always be to read the documentation. On one end of the scale, some software may claim to make use of multiple cores implicitly, but this should be verified as the methods used to determine available resources are not guaranteed to work.
Some software will require you to specify number of cores (e.g. -n 8
or -np 16
), or even type of paralellisation (e.g. -dis
or -mpi=intelmpi
).
Occasionally your input files may require rewriting/regenerating for every new CPU combintation (e.g. domain based parallelism without automatic partitioning).
Writing Code
Occasionally requesting more CPUs in your Slurm job is all that is required and whatever program you are running will automagically take advantage of the additional resources. However, it’s more likely to require some amount of effort on your behalf.
It is important to determine this before you start requesting more resources through Slurm
If you are writing your own code, some programming languages will have functions that can make use of multiple CPUs without requiring you to changes your code. However, unless that function is where the majority of time is spent, this is unlikely to give you the performance you are looking for.
Python: Multiproccessing (not to be confused with threading
which is not really parallel.)
MATLAB: Parpool
Key Points
Parallel programming allows applications to take advantage of parallel hardware; serial code will not ‘just work.’
There are multiple ways you can run