What is Parallel Computing
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How do we execute a task in parallel?
What benefits arise from parallel execution?
What are the limits of gains from execution in parallel?
What is the difference between implicit and explicit parallelisation.
Objectives
Prepare a job submission script for the parallel executable.
Methods of Parallel Computing
To understand the different types of Parallel Computing we first need to clarify some terms.
CPU: Unit that does the computations.
Task: Like a thread, but multiple tasks do not need to share memory.
Node: A single computer of the cluster. Nodes are made up of CPUs and RAM.
Shared Memory: When multiple CPUs are used within a single task.
Distributed Memory: When multiple tasks are used.
Which methods are available to you is largely dependent on the nature of the problem and software being used.
Shared-Memory (SMP)
Shared-memory multiproccessing divides work among CPUs or threads, all of these threads require access to the same memory.
Often called Multithreading.
This means that all CPUs must be on the same node, most Mahuika nodes have 72 CPUs.
Shared memory parallelism is used in our example script sum_matrix.r
.
Number of threads to use is specified by the Slurm option --cpus-per-task
.
Distributed-Memory (MPI)
Distributed-memory multiproccessing divides work among tasks, a task may contain multiple CPUs (provided they all share memory, as discussed previously).
Message Passing Interface (MPI) is a communication standard for distributed-memory multiproccessing. While there are other standards, often ‘MPI’ is used synonymously with Distributed parallelism.
Each task has it’s own exclusive memory, tasks can be spread across multiple nodes, communicating via and interconnect. This allows MPI jobs to be much larger than shared memory jobs. It also means that memory requirements are more likely to increase proportionally with CPUs.
Distributed-Memory multiproccessing predates shared-memory multiproccessing, and is more common with classical high performance applications (older computers had one CPU per node).
Number of tasks to use is specified by the Slurm option --ntasks
, because the number of tasks ending up on one node is variable you should use --mem-per-cpu
rather than --mem
to ensure each task has enough.
Tasks cannot share cores, this means in most circumstances leaving --cpus-per-task
unspecified will get you 2
.
Using a combination of Shared and Distributed memory is called Hybrid Parallel.
GPGPU’s
GPUs compute large number of simple operations in parallel, making them well suited for Graphics Processing (hence the name), or any other large matrix operations.
On NeSI, GPU’s are specialised pieces of hardware that you request in addition to your CPUs and memory.
You can find an up-to-date(ish) list of GPUs available on NeSI in our Support Documentation
GPUs can be requested using --gpus-per-node=<gpu_type>:<gpu_number>
Depending on the GPU type, we may also need to specify a partition using --partition
.
GPU Job Example
Create a new script called
gpu-job.sl
#!/bin/bash -e #SBATCH --job-name gpu-job #SBATCH --account nesi99991 #SBATCH --output %x.out #SBATCH --mem-per-cpu 2G #SBATCH --gpu-per-node P100:1 module load CUDA nvidia-smi
then submit with
[yourUsername@mahuika ~]$ sbatch gpu-job.sl
Solution
[yourUsername@mahuika ~]$ cat gpu-job.out
Tue Mar 12 19:40:51 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:05:00.0 Off | 0 | | N/A 28C P0 24W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Job Array
Job arrays are not “multiproccessing” in the same way as the previous two methods. Ideal for embarrassingly parallel problems, where there are little to no dependencies between the different jobs.
Can be thought of less as running a single job in parallel and more about running multiple serial-jobs simultaneously. Often this will involve running the same process on multiple inputs.
Embarrassingly parallel jobs should be able to scale without any loss of efficiency. If this type of parallelisation is an option, it will almost certainly be the best choice.
A job array can be specified using --array
If you are writing your own code, then this is something you will probably have to specify yourself.
How to Utilise Multiple CPUs
Requesting extra resources through Slurm only means that more resources will be available, it does not guarantee your program will be able to make use of them.
Generally speaking, Parallelism is either implicit where the software figures out everything behind the scenes, or explicit where the software requires extra direction from the user.
Scientific Software
The first step when looking to run particular software should always be to read the documentation. On one end of the scale, some software may claim to make use of multiple cores implicitly, but this should be verified as the methods used to determine available resources are not guaranteed to work.
Some software will require you to specify number of cores (e.g. -n 8
or -np 16
), or even type of paralellisation (e.g. -dis
or -mpi=intelmpi
).
Occasionally your input files may require rewriting/regenerating for every new CPU combintation (e.g. domain based parallelism without automatic partitioning).
Writing Code
Occasionally requesting more CPUs in your Slurm job is all that is required and whatever program you are running will automagically take advantage of the additional resources. However, it’s more likely to require some amount of effort on your behalf.
It is important to determine this before you start requesting more resources through Slurm
If you are writing your own code, some programming languages will have functions that can make use of multiple CPUs without requiring you to changes your code. However, unless that function is where the majority of time is spent, this is unlikely to give you the performance you are looking for.
Python: Multiproccessing (not to be confused with threading
which is not really parallel.)
MATLAB: Parpool
Summary
Name | Other Names | Slurm Option | Pros/cons |
---|---|---|---|
Shared Memory Parallelism | Multithreading, Multiproccessing | --cpus-per-task |
does not have to use interconnect/can only use max one node |
Distrubuted Memory Parallelism | MPI, OpenMPI | --ntasks and add srun before command |
can use any number of nodes/does have to use interconnect |
Hybrid | --ntasks and --cpus-per-task and add srun before command |
best of both!/can be tricky to set-up and get right | |
Job Array | --array |
most efficient form of parallelism/not as flexible and requires more manual set-up in Slurm script | |
General Purpose GPU | --gpus-per-node |
GPUs can accelerate your code A LOT/sometimes GPUs are all busy |
Running a Parallel Job.
Pick one of the method of Paralellism mentioned above, and modify your
example.sl
script to use this method.Solution
What does the printout say at the start of your job about number and location of node.
Key Points
Parallel programming allows applications to take advantage of parallel hardware; serial code will not ‘just work.’
There are multiple ways you can run