What is Parallel Computing

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How do we execute a task in parallel?

  • What benefits arise from parallel execution?

  • What are the limits of gains from execution in parallel?

  • What is the difference between implicit and explicit parallelisation.

Objectives
  • Prepare a job submission script for the parallel executable.

Methods of Parallel Computing

To understand the different types of Parallel Computing we first need to clarify some terms.

/hpc-intro/Node%20anatomy

CPU: Unit that does the computations.

Task: One or more CPUs that share memory.

Node: The physical hardware. The upper limit on how many CPUs can be in a task.

Shared Memory: When multiple CPUs are used within a single task.

Distributed Memory: When multiple tasks are used.

Which methods are available to you is largely dependent on the nature of the problem and software being used.

Shared-Memory (SMP)

Shared-memory multiproccessing divides work among CPUs or threads, all of these threads require access to the same memory.

Often called Multithreading.

This means that all CPUs must be on the same node, most Mahuika nodes have 72 CPUs.

Shared memory parallelism is what is used in our example script array_sum.r .

Number of threads to use is specified by the Slurm option --cpus-per-task.

Shared Memory Example

Create a new script called example_smp.sl

#!/bin/bash -e

#SBATCH --job-name        smp_job
#SBATCH --output          %x.out
#SBATCH --mem-per-cpu     500
#SBATCH --cpus-per-task   8

echo "I am task #${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs"

then submit with

[yourUsername@mahuika ~]$ sbatch example_smp.sl

Solution

Checking the output should reveal

[yourUsername@mahuika ~]$ cat smp_job.out
I am task #0 running on node 'wbn224' with 8 CPUs

Distributed-Memory (MPI)

Distributed-memory multiproccessing divides work among tasks, a task may contain multiple CPUs (provided they all share memory, as discussed previously).

Message Passing Interface (MPI) is a communication standard for distributed-memory multiproccessing. While there are other standards, often ‘MPI’ is used synonymously with Distributed parallelism.

Each task has it’s own exclusive memory, tasks can be spread across multiple nodes, communicating via and interconnect. This allows MPI jobs to be much larger than shared memory jobs. It also means that memory requirements are more likely to increase proportionally with CPUs.

Distributed-Memory multiproccessing predates shared-memory multiproccessing, and is more common with classical high performance applications (older computers had one CPU per node).

Number of tasks to use is specified by the Slurm option --ntasks, because the number of tasks ending up on one node is variable you should use --mem-per-cpu rather than --mem to ensure each task has enough.

Tasks cannot share cores, this means in most circumstances leaving --cpus-per-task unspecified will get you 2.

Distributed Memory Example

Create a new script called example_mpi.sl

#!/bin/bash -e

#SBATCH --job-name            mpi_job
#SBATCH --output          %x.out
#SBATCH --mem-per-cpu     500
#SBATCH --ntasks          4

srun echo "I am task #${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs"

then submit with

[yourUsername@mahuika ~]$ sbatch example_mpi.sl

Solution

[yourUsername@mahuika ~]$ cat mpi_job.out
I am task #1 running on node 'wbn012' with 2 CPUs
I am task #3 running on node 'wbn010' with 2 CPUs
I am task #0 running on node 'wbn009' with 2 CPUs
I am task #2 running on node 'wbn063' with 2 CPUs

Using a combination of Shared and Distributed memory is called Hybrid Parallel.

Hybrid Example

#!/bin/bash -e

#SBATCH --job-name        hybrid_job
#SBATCH --output          %x.out
#SBATCH --mem-per-cpu     500
#SBATCH --ntasks          2
#SBATCH --cpus-per-task   4

srun echo "I am task #${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs"
[yourUsername@mahuika ~]$ sbatch example_hybrid.sl

Solution

[yourUsername@mahuika ~]$ cat hybrid_job.out

I am task #0 running on node 'wbn016' with 4 CPUs
I am task #1 running on node 'wbn022' with 4 CPUs

GPGPU’s

GPUs compute large number of simple operation in parallel, making them well suited for Graphics Processing (hence the name), or any other large matrix operations.

On NeSI, GPU’s are specialised pieces of hardware that you request in addition to your CPUs and memory.

You can find an up-to-date(ish) list of GPUs available on NeSI in our Support Documentation

GPUs can be requested using -gpus-per-node=<gpu_type>:<gpu_number>

Depending on the GPU type, we may also need to specify a partition using --partition.

GPU Job Example

Create a new script called example_gpu.sl

#!/bin/bash -e

#SBATCH --job-name        gpu
#SBATCH --account         nesi99991 
#SBATCH --output          %x_%a.out
#SBATCH --mem-per-cpu     2G
#SBATCH --gpu-per-node    P100:1

module load CUDA
nvidia-smi  

then submit with

[yourUsername@mahuika ~]$ sbatch example_gpu.sl

Solution

[yourUsername@mahuika ~]$ cat gpu_job.out

Tue Mar 12 19:40:51 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   28C    P0    24W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Job Array

Job arrays are not “multiproccessing” in the same way as the previous two methods. Ideal for embarrassingly parallel problems, where there are little to no dependencies between the different jobs.

Can be thought of less as running a single job in parallel and more about running multiple serial-jobs simultaneously. Often this will involve running the same process is run on multiple inputs.

Embarrassingly parallel jobs should be able scale without any loss of efficiency. If this type of parallelisation is an option, it will almost certainly be the best choice.

A job array can be specified using --array

If you are writing your own code, then this is something you will probably have to specify yourself.

Job Array Example

Create a new script called example_jobarray.sl

#!/bin/bash -e

#SBATCH --job-name        job_array
#SBATCH --output          %x_%a.out
#SBATCH --mem-per-cpu     500
#SBATCH --array           0-3

echo "I am task #${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs"

then submit with

[yourUsername@mahuika ~]$ sbatch example_jobarray.sl

Solution

ls
job_array_0.out job_array_1.out job_array_2.out job_array_3.out

Each of which should contain,

 [yourUsername@mahuika ~]$ cat job_array_*.out
I am task #0 running on node 'wbn*' with 2 CPUs

How to Utilise Multiple CPUs

Requesting extra resources through Slurm only means that more resources will be available, it does not guarantee your program will be able to make use of them.

Generally speaking, Parallelism is either implicit where the software figures out everything behind the scenes, or explicit where the software requires extra direction from the user.

Scientific Software

The first step when looking to run particular software should always be to read the documentation. On one end of the scale, some software may claim to make use of multiple cores implicitly, but this should be verified as the methods used to determine available resources are not guaranteed to work.

Some software will require you to specify number of cores (e.g. -n 8 or -np 16), or even type of paralellisation (e.g. -dis or -mpi=intelmpi).

Occasionally your input files may require rewriting/regenerating for every new CPU combintation (e.g. domain based parallelism without automatic partitioning).

Writing Code

Occasionally requesting more CPUs in your Slurm job is all that is required and whatever program you are running will automagically take advantage of the additional resources. However, it’s more likely to require some amount of effort on your behalf.

It is important to determine this before you start requesting more resources through Slurm

If you are writing your own code, some programming languages will have functions that can make use of multiple CPUs without requiring you to changes your code. However, unless that function is where the majority of time is spent, this is unlikely to give you the performance you are looking for.

Python: Multiproccessing (not to be confused with threading which is not really parallel.)

MATLAB: Parpool

Key Points

  • Parallel programming allows applications to take advantage of parallel hardware; serial code will not ‘just work.’

  • There are multiple ways you can run