Introduction to High-Performance Computing

Working on a remote HPC system

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What is an HPC system?

  • How does an HPC system work?

  • How do I log in to a remote HPC system?

Objectives
  • Connect to a remote HPC system.

  • Understand the general HPC system architecture.

What Is an HPC System?

The words “cloud”, “cluster”, and the phrase “high-performance computing” or “HPC” are used a lot in different contexts and with various related meanings. So what do they mean? And more importantly, how do we use them in our work?

A Remote computer is one you have no access to physically and must connect via a network (as opposed to Local)

Cloud refers to remote computing resources that are provisioned to users on demand or as needed.

HPC, High Performance Computer, High Performance Computing or Supercomputer are all general terms for a large or powerful computing resource.

Cluster is a more specific term describing a type of supercomputer comprised of multiple smaller computers (nodes) working together. Almost all supercomputers are clusters.

Access

You will connect to a cluster over the internet either with a web client (Jupyter) or with SSH (Secure Shell). Your main interface with the cluster will be using command line.

Nodes

Individual computers that compose a cluster are typically called nodes. On a cluster, there are different types of nodes for different types of tasks. The node where you are now will be different depending on how you accessed the cluster.

Most of you (using JupyterHub) will be on an interactive compute node. This is because Jupyter sessions are launched as a job. If you are using SSH to connect to the cluster, you will be on a login node. Both JupyterHub and SSH login nodes serve as an access point to the cluster.

The real work on a cluster gets done by the compute nodes. Compute nodes come in many shapes and sizes, but generally are dedicated to long or hard tasks that require a lot of computational resources.

What’s in a Node?

A node is similar in makeup to a regular desktop or laptop, composed of CPUs (sometimes also called processors or cores), memory (or RAM), and disk space. Although, where your laptop might have 8 CPUs and 16GB of memory, a compute node will have hundreds of cores and GB of memory.

/hpc-intro/Node%20anatomy

Differences Between Nodes

Many HPC clusters have a variety of nodes optimized for particular workloads. Some nodes may have larger amount of memory, or specialized resources such as Graphical Processing Units (GPUs).

Dedicated Transfer Nodes

If you want to transfer larger amounts of data to or from the cluster, NeSI offers dedicated transfer nodes using the Globus service. More information on using Globus for large data transfer to and from the cluster can be found here: Globus Transfer Service

Key Points

  • An HPC system is a set of networked machines.

  • HPC systems typically provide login nodes and a set of compute nodes.

  • The resources found on independent (compute) nodes can vary in volume and type (amount of RAM, processor architecture, availability of network mounted filesystems, etc.).

  • Files saved on shared storage are available on all nodes.

  • The login node is a shared machine: be considerate of other users.


NeSI Filesystem

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • Where is the best place to store my data?

  • How do I recover deleted files?

  • How do I find out how much disk space I have?

Objectives
  • Learn about the NeSI filesystems, and when to use each one.

The NeSI filesystem looks something like this:

The file system is made up of a root directory that contains sub-directories
titled home, nesi, and system files

The directories that are relevant to us are.

Location Default Storage Default Files Backup Access Speed
Home is for user-specific files such as configuration files, environment setup, source code, etc. /home/<username> 20GB 1,000,000 Daily Normal
Project is for persistent project-related data, project-related software, etc. /nesi/project/<projectcode> 100GB 100,000 Daily Normal
Nobackup is a 'scratch space', for data you don't need to keep long term. Old data is periodically deleted from nobackup /nesi/nobackup/<projectcode> 10TB 1,000,000 None Fast

Managing your data and storage (backups and quotas)

NeSI performs backups of the /home and /nesi/project (persistent) filesystems. However, backups are only captured once per day. So, if you edit or change code or data and then immediately delete it, it likely cannot be recovered. Note, as the name suggests, NeSI does not backup the /nesi/nobackup filesystem.

Protecting critical data from corruption or deletion is primarily your responsibility. Ensure you have a data management plan and stick to the plan to reduce the chance of data loss. Also important is managing your storage quota. To check your quotas, use the nn_storage_quota command, eg

$ nn_storage_quota
Quota Location                    Available         Used      Use%     State       Inodes        IUsed     IUse%    IState
home_johndoe                            20G       14.51G    72.57%        OK      1000000       112084    11.21%        OK
project_nesi99991                      100G         101G    101.00%       LOCKED  100000           194     0.19%        OK
nobackup_nesi99991                      10T            0     0.00%        OK      1000000           14     0.00%        OK

As well as disk space, ‘inodes’ are also tracked, this is the number of files.

Notice that the project space for this user is over quota and has been locked, meaning no more data can be added. When your space is locked you will need to move or remove data. Also note that none of the nobackup space is being used. Likely data from project can be moved to nobackup. nn_storage_quota uses cached data, and so will no immediately show changes to storage use.

For more details on our persistent and nobackup storage systems, including data retention and the nobackup autodelete schedule, please see our Filesystem and Quota documentation.

Working Directory

We will be working from the directory resbaz24.

[yourUsername@mahuika ~]$ cd /nesi/project/nesi99991/resbaz24

Creating directories

As previously mentioned, it is general useful to organise your work in a hierarchical file structure to make managing and finding files easier. It is also is especially important when working within a shared directory with colleagues, such as a project, to minimise the chance of accidentally affecting your colleagues work. So for this workshop you will each make a directory using the mkdir command within the workshops directory for you to personally work from.

[yourUsername@mahuika ~]$ mkdir <username>

You should then be able to see your new directory is there using ls.

[yourUsername@mahuika ~]$ ls /nesi/project/nesi99991/resbaz24
 array_sum.r   usr123  usr345

Create a text file

Now let’s create a file. To do this we will use a text editor called Nano to create a file called draft.txt:

We will want to do this from inside the directory we just created.

[yourUsername@mahuika ~]$ cd <username>
[yourUsername@mahuika ~]$ nano draft.txt

Which Editor?

When we say, ‘nano is a text editor’ we really do mean ‘text’: it can only work with plain character data, not tables, images, or any other human-friendly media. We use it in examples because it is one of the least complex text editors. However, because of this trait, it may not be powerful enough or flexible enough for the work you need to do after this workshop. On Unix systems (such as Linux and macOS), many programmers use Emacs or Vim (both of which require more time to learn), or a graphical editor such as Gedit. On Windows, you may wish to use Notepad++. Windows also has a built-in editor called notepad that can be run from the command line in the same way as nano for the purposes of this lesson.

No matter what editor you use, you will need to know where it searches for and saves files. If you start it from the shell, it will (probably) use your current working directory as its default location. If you use your computer’s start menu, it may want to save files in your desktop or documents directory instead. You can change this by navigating to another directory the first time you ‘Save As…’

Let’s type in a few lines of text. Once we’re happy with our text, we can press Ctrl+O (press the Ctrl or Control key and, while holding it down, press the O key) to write our data to disk (we’ll be asked what file we want to save this to: press Return to accept the suggested default of draft.txt).

screenshot of nano text editor in action

Once our file is saved, we can use Ctrl+X to quit the editor and return to the shell.

Control, Ctrl, or ^ Key

The Control key is also called the ‘Ctrl’ key. There are various ways in which using the Control key may be described. For example, you may see an instruction to press the Control key and, while holding it down, press the X key, described as any of:

  • Control-X
  • Control+X
  • Ctrl-X
  • Ctrl+X
  • ^X
  • C-x

In nano, along the bottom of the screen you’ll see ^G Get Help ^O WriteOut. This means that you can use Control-G to get help and Control-O to save your file.

nano doesn’t leave any output on the screen after it exits, but ls now shows that we have created a file called draft.txt:

[yourUsername@mahuika ~]$ ls
draft.txt

Copying files and directories

In a future lesson, we will be running the R script /nesi/project/nesi99991/resbaz24/array_sum.r, but as we can’t all work on the same file at once you will need to take your own copy. This can be done with the copy command cp, at least two arguments are needed the file (or directory) you want to copy, and the directory (or file) where you want the copy to be created. We will be copying the file into the directory we made previously, as this should be your current directory the second argument can be a simple ..

[yourUsername@mahuika ~]$ cp /nesi/project/nesi99991/resbaz24/array_sum.r  .

We can check that it did the right thing using ls

[yourUsername@mahuika ~]$ ls
draft.txt   array_sum.r 

Key Points


Accessing software via Modules

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How do we load and unload software packages?

Objectives
  • Load and use a software package.

  • Explain how the shell environment changes when the module mechanism loads or unloads packages.

On a high-performance computing system, it is seldom the case that the software we want to use is available when we log in. It is installed, but we will need to “load” it before it can run.

Before we start using individual software packages, however, we should understand the reasoning behind this approach. The three biggest factors are:

Software incompatibility is a major headache for programmers. Sometimes the presence (or absence) of a software package will break others that depend on it. Two of the most famous examples are Python 2 and 3 and C compiler versions. Python 3 famously provides a python command that conflicts with that provided by Python 2. Software compiled against a newer version of the C libraries and then used when they are not present will result in a nasty 'GLIBCXX_3.4.20' not found error, for instance.

Software versioning is another common issue. A team might depend on a certain package version for their research project - if the software version was to change (for instance, if a package was updated), it might affect their results. Having access to multiple software versions allows a set of researchers to prevent software versioning issues from affecting their results.

Dependencies are where a particular software package (or even a particular version) depends on having access to another software package (or even a particular version of another software package). For example, the VASP materials science software may depend on having a particular version of the FFTW (Fastest Fourier Transform in the West) software library available for it to work.

Environment

Before understanding environment modules we first need to understand what is meant by environment.

The environment is defined by it’s environment variables.

Environment Variables are writable named-variables.

We can assign a variable named “FOO” with the value “bar” using the syntax.

[yourUsername@mahuika ~]$ FOO="bar"

Convention is to name fixed variables in all caps.

Our new variable can be referenced using $FOO, you could also use ${FOO}, enclosing a variable in curly brackets is good practice as it avoids possible ambiguity.

[yourUsername@mahuika ~]$ $FOO
-bash: bar: command not found

We got an error here because the variable is evalued in the terminal then executed. If we just want to print the variable we can use the command,

[yourUsername@mahuika ~]$ echo $FOO
bar

We can get a full list of environment variables using the command,

[yourUsername@mahuika ~]$ env
[removed some lines for clarity]
EBROOTTCL=/opt/nesi/CS400_centos7_bdw/Tcl/8.6.10-GCCcore-9.2.0
CPUARCH_STRING=bdw
TERM=xterm-256color
SHELL=/bin/bash
EBROOTGCCCORE=/opt/nesi/CS400_centos7_bdw/GCCcore/9.2.0
EBDEVELFREETYPE=/opt/nesi/CS400_centos7_bdw/freetype/2.10.1-GCCcore-9.2.0/easybuild/freetype-2.10.1-GCCcore-9.2.0-easybuild-devel
HISTSIZE=10000
XALT_EXECUTABLE_TRACKING=yes
MODULEPATH_ROOT=/usr/share/modulefiles
LMOD_SYSTEM_DEFAULT_MODULES=NeSI
SSH_CLIENT=192.168.94.65 45946 22
EBDEVELMETIS=/opt/nesi/CS400_centos7_bdw/METIS/5.1.0-GCCcore-9.2.0/easybuild/METIS-5.1.0-GCCcore-9.2.0-easybuild-devel
XALT_DIR=/opt/nesi/CS400_centos7_bdw/XALT/current
LMOD_PACKAGE_PATH=/opt/nesi/share/etc/lmod

These variables control many aspects of how your terminal, and any software launched from your terminal works.

Environment Modules

Environment modules are the solution to these problems. A module is a self-contained description of a software package – it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.

There are a number of different environment module implementations commonly used on HPC systems: the two most common are TCL modules and Lmod. Both of these use similar syntax and the concepts are the same so learning to use one will allow you to use whichever is installed on the system you are using. In both implementations the module command is used to interact with environment modules. An additional subcommand is usually added to the command to specify what you want to do. For a list of subcommands you can use module -h or module help. As for all commands, you can access the full help on the man pages with man module.

Purging Modules

Depending on how you are accessing the HPC the modules you have loaded by default will be different. So before we start listing our modules we will first use the module purge command to clear all but the minimum default modules so that we are all starting with the same modules.

[yourUsername@mahuika ~]$ module purge

The following modules were not unloaded:
   (Use "module --force purge" to unload all):

  1) XALT/minimal   2) slurm   3) NeSI

Note that module purge is informative. It lets us know that all but a minimal default set of packages have been unloaded (and how to actually unload these if we truly so desired).

We are able to unload individual modules, unfortunately within the NeSI system it does not always unload it’s dependencies, therefore we recommend module purge to bring you back to a state where only those modules needed to perform your normal work on the cluster.

module purge is a useful tool for ensuring repeatable research by guaranteeing that the environment that you build your software stack from is always the same. This is important since some modules have the potential to silently effect your results if they are loaded (or not loaded).

Listing Available Modules

To see available software modules, use module avail:

[yourUsername@mahuika ~]$ module avail
-----------------/opt/nesi/CS400_centos7_bdw/modules/all ------------------
  Flye/2.9-gimkl-2020a-Python-3.8.2      (D)    PyQt/5.10.1-gimkl-2018b-Python-3.7.3
  fmlrc/1.0.0-GCC-9.2.0                         PyQt/5.12.1-gimkl-2018b-Python-2.7.16
  fmt/7.1.3-GCCcore-9.2.0                       PyQt/5.12.1-gimkl-2020a-Python-3.8.2   (D) 
  fmt/8.0.1                              (D)    pyspoa/0.0.8-gimkl-2018b-Python-3.8.1
  fontconfig/2.12.1-gimkl-2017a                 Python-Geo/2.7.14-gimkl-2017a
  fontconfig/2.13.1-GCCcore-7.4.0               Python-Geo/2.7.16-gimkl-2018b
  fontconfig/2.13.1-GCCcore-9.2.0        (D)    Python-Geo/3.6.3-gimkl-2017a
  forge/19.0                                    Python-Geo/3.7.3-gimkl-2018b
  forge/20.0.2                           (D)    Python-Geo/3.8.2-gimkl-2020a
  FoX/4.1.2-intel-2018b                         Python-Geo/3.9.5-gimkl-2020a           (D)
  FragGeneScan/1.31-gimkl-2018b                 Python-GPU/3.6.3-gimkl-2017a
  FreeBayes/1.1.0-gimkl-2017a                   Python/2.7.14-gimkl-2017a
  FreeBayes/1.3.1-GCC-7.4.0                     Python/2.7.16-gimkl-2018b
  FreeBayes/1.3.2-GCC-9.2.0              (D)    Python/2.7.16-intel-2018b
  freetype/2.7.1-gimkl-2017a                    Python/2.7.18-gimkl-2020a
  freetype/2.9.1-GCCcore-7.4.0                  Python/3.6.3-gimkl-2017a
  freetype/2.10.1-GCCcore-9.2.0          (D)    Python/3.7.3-gimkl-2018b
  FreeXL/1.0.2-gimkl-2017a                      Python/3.8.1-gimkl-2018b
  FreeXL/1.0.5-GCCcore-7.4.0             (D)    Python/3.8.2-gimkl-2020a                 (D) 
  FreeXL/1.0.5-GCCcore-9.2.0                    Python/3.9.5-gimkl-2020a
  FriBidi/1.0.10-GCCcore-9.2.0                  qcat/1.1.0-gimkl-2020a-Python-3.8.2

[removed most of the output here for clarity]

----------------------------------- /cm/local/modulefiles -----------------------------------
   cluster-tools/8.0    freeipmi/1.5.5     module-git     openmpi/mlnx/gcc/64/2.1.2a1
   cmd                  gcc/6.3.0          module-info    shared
   cuda-dcgm/1.3.3.1    ipmitool/1.8.18    null
   dot                  lua/5.3.4          openldap

  Where:
   D:  Default Module

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the
"keys".

Listing Currently Loaded Modules

You can use the module list command to see which modules you currently have loaded in your environment. On mahuika you will have a few default modules loaded when you login.

[yourUsername@mahuika ~]$ module list

Currently Loaded Modules:
  1) XALT/minimal   2) slurm   3) NeSI (S)

Loading and Unloading Software

You can load software using the module load command. In this example we will be using the programming language R.

Initially, R is not loaded. We can test this by using the which command. which looks for programs the same way that Bash does, so we can use it to tell us where a particular piece of software is stored.

[yourUsername@mahuika ~]$ which R
/usr/bin/which: no R in (/opt/nesi/CS400_centos7_bdw/XALT/current/bin:/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/bin:/opt/nesi/CS400_centos7_bdw/OpenSSL/1.1.1k-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/Tk/8.6.10-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/Tcl/8.6.10-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/SQLite/3.36.0-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/netCDF/4.8.1-gimpi-2022a/bin:/opt/nesi/CS400_centos7_bdw/cURL/7.83.1-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/libxslt/1.1.34-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/libxml2/2.9.10-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/ncurses/6.2-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/libjpeg-turbo/2.1.3-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/HDF5/1.12.2-gimpi-2022a/bin:/opt/nesi/CS400_centos7_bdw/freetype/2.11.1-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/libpng/1.6.37-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/XZ/5.2.5-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/bzip2/1.0.8-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/impi/2021.5.1-GCC-11.3.0/mpi/2021.5.1/libfabric/bin:/opt/nesi/CS400_centos7_bdw/impi/2021.5.1-GCC-11.3.0/mpi/2021.5.1/bin:/opt/nesi/CS400_centos7_bdw/UCX/1.12.1-GCC-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/numactl/2.0.14-GCC-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/binutils/2.38-GCCcore-11.3.0/bin:/opt/nesi/CS400_centos7_bdw/GCCcore/11.3.0/bin:/opt/slurm/sbin:/opt/slurm/bin:/opt/nesi/share/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)

The important bit here being:

/usr/bin/which: no R in (...)

Now lets try loading the R environment module, and try again.

[yourUsername@mahuika ~]$ module load R
[yourUsername@mahuika ~]$ which R
/opt/nesi/CS400_centos7_bdw/R/4.2.1-gimkl-2022a/bin/R

Tab Completion

The module command also supports tab completion. You may find this the easiest way to find the right software.

So, what just happened?

To understand the output, first we need to understand the nature of the $PATH environment variable. $PATH is a special environment variable that controls where a UNIX system looks for software. Specifically $PATH is a list of directories (separated by :) that the OS searches through for a command before giving up and telling us it can’t find it. As with all environment variables we can print it out using echo.

[yourUsername@mahuika ~]$ echo $PATH 
/opt/nesi/CS400_centos7_bdw/XALT/current/bin:/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/bin:/opt/nesi/CS400_centos7_bdw/Tk/8.6.10-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/Tcl/8.6.10-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/SuiteSparse/5.6.0-gimkl-2020a-METIS-5.1.0/bin:/opt/nesi/CS400_centos7_bdw/METIS/5.1.0-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/SQLite/3.31.1-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/netCDF/4.7.3-gimpi-2020a/bin:/opt/nesi/CS400_centos7_bdw/PCRE/8.43-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/cURL/7.64.0-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/libxslt/1.1.34-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/libxml2/2.9.10-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/ncurses/6.1-GCCcore-9.2.0/bin:/opt/nesi/mahuika/libjpeg-turbo/2.0.2-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/HDF5/1.10.5-gimpi-2020a/bin:/opt/nesi/CS400_centos7_bdw/freetype/2.10.1-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/libpng/1.6.37-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/XZ/5.2.4-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/bzip2/1.0.8-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/impi/2019.6.166-GCC-9.2.0/intel64/libfabric/bin:/opt/nesi/CS400_centos7_bdw/impi/2019.6.166-GCC-9.2.0/intel64/bin:/opt/nesi/CS400_centos7_bdw/binutils/2.32-GCCcore-9.2.0/bin:/opt/nesi/CS400_centos7_bdw/GCCcore/9.2.0/bin:/home/harrellw/bin:/home/harrellw/.local/bin:/home/harrellw/apps/bin:/usr/lpp/mmfs/bin:/opt/slurm/sbin:/opt/slurm/bin:/opt/nesi/share/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/nesi/bin:/opt/ibutils/bin

We can improve the readability of this command slightly by replacing the colon delimiters (:) with newline (\n) characters.

[yourUsername@mahuika ~]$ echo $PATH | tr ":" "\n"
/opt/nesi/CS400_centos7_bdw/XALT/current/bin
/opt/nesi/CS400_centos7_bdw/R/4.2.1-gimkl-2022a/bin
/opt/nesi/CS400_centos7_bdw/nodejs/16.15.1-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/Java/17
/opt/nesi/CS400_centos7_bdw/Java/17/bin
/opt/nesi/CS400_centos7_bdw/PCRE2/10.40-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/bin
/opt/nesi/CS400_centos7_bdw/OpenSSL/1.1.1k-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/Tk/8.6.10-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/Tcl/8.6.10-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/SQLite/3.36.0-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/netCDF/4.8.1-gimpi-2022a/bin
/opt/nesi/CS400_centos7_bdw/cURL/7.83.1-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/libxslt/1.1.34-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/libxml2/2.9.10-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/ncurses/6.2-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/libjpeg-turbo/2.1.3-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/HDF5/1.12.2-gimpi-2022a/bin
/opt/nesi/CS400_centos7_bdw/freetype/2.11.1-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/libpng/1.6.37-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/XZ/5.2.5-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/bzip2/1.0.8-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/impi/2021.5.1-GCC-11.3.0/mpi/2021.5.1/libfabric/bin
/opt/nesi/CS400_centos7_bdw/impi/2021.5.1-GCC-11.3.0/mpi/2021.5.1/bin
/opt/nesi/CS400_centos7_bdw/UCX/1.12.1-GCC-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/numactl/2.0.14-GCC-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/binutils/2.38-GCCcore-11.3.0/bin
/opt/nesi/CS400_centos7_bdw/GCCcore/11.3.0/bin
/opt/slurm/sbin
/opt/slurm/bin
/opt/nesi/share/bin
/usr/local/bin
/usr/bin
/usr/local/sbin
/usr/sbin
/opt/ibutils/bin
/opt/nesi/vdt
/opt/nesi/bin

You’ll notice a similarity to the output of the which command. However, in this case, there are a lot more directories at the beginning. When we ran the module load command, it added many directories to the beginning of our $PATH.

The path to NeSI XALT utility will normally show up first. This helps us track software usage, but the more important directory is the second one: /opt/nesi/CS400_centos7_bdw/R/4.2.1-gimkl-2022a/bin Let’s examine what’s there:

[yourUsername@mahuika ~]$ ls /opt/nesi/CS400_centos7_bdw/R/4.2.1-gimkl-2022a/bin
R  Rscript  

module load “loads” not only the specified software, but it also loads software dependencies. That is, the software that the application you load requires to run.

To demonstrate, let’s use module list.

[yourUsername@mahuika ~]$ module list
Currently Loaded Modules:
  1) XALT/minimal                       8) GCC/11.3.0                 15) gimpi/2022a                     22) libreadline/8.1-GCCcore         29) Java/17
  2) slurm                              9) libpmi/2-slurm             16) imkl-FFTW/2022.0.2-gimpi-2022a  23) libpng/1.6.37-GCCcore           30) nodejs/16.15.1-GCCcore-11.3.0
  3) NeSI                         (S)  10) numactl/2.0.14-GCC-11.3.0  17) gimkl/2022a                     24) libxml2/2.9.10-GCCcore-11.3.0   31) OpenSSL/1.1.1k-GCCcore-11.3.0
  4) LegacySystemLibs/7                11) UCX/1.12.1-GCC-11.3.0      18) bzip2/1.0.8-GCCcore-11.3.0      25) SQLite/3.36.0-GCCcore-11.3.0    32) R/4.2.1-gimkl-2022a
  5) GCCcore/11.3.0                    12) impi/2021.5.1-GCC-11.3.0   19) XZ/5.2.5-GCCcore-11.3.0         26) cURL/7.83.1-GCCcore-11.3.0
  6) zlib/1.2.11-GCCcore-11.3.0        13) AlwaysIntelMKL/1.0         20) PCRE2/10.40-GCCcore-11.3.0      27) NLopt/2.7.0-GCC-11.3.0
  7) binutils/2.38-GCCcore-11.3.0      14) imkl/2022.0.2              21) ncurses/6.2-GCCcore-11.3.0      28) GMP/6.2.1-GCCcore-11.3.0

Notice that our initial list of modules has increased by 30. When we loaded R, it also loaded all of it’s dependencies along with all the dependencies of those modules.

Before moving onto the next session lets use module purge again to return to the minimal environment.

[yourUsername@mahuika ~]$ module purge
The following modules were not unloaded:
   (Use "module --force purge" to unload all):

  1) XALT/minimal   2) slurm   3) NeSI

Software Versioning

So far, we’ve learned how to load and unload software packages. However, we have not yet addressed the issue of software versioning. At some point or other, you will run into issues where only one particular version of some software will be suitable. Perhaps a key bugfix only happened in a certain version, or version X broke compatibility with a file format you use. In either of these example cases, it helps to be very specific about what software is loaded.

Let’s examine the output of module avail more closely.

[yourUsername@mahuika ~]$ module avail
-----------------/opt/nesi/CS400_centos7_bdw/modules/all ------------------
  Flye/2.9-gimkl-2020a-Python-3.8.2      (D)    PyQt/5.10.1-gimkl-2018b-Python-3.7.3
  fmlrc/1.0.0-GCC-9.2.0                         PyQt/5.12.1-gimkl-2018b-Python-2.7.16
  fmt/7.1.3-GCCcore-9.2.0                       PyQt/5.12.1-gimkl-2020a-Python-3.8.2   (D) 
  fmt/8.0.1                              (D)    pyspoa/0.0.8-gimkl-2018b-Python-3.8.1
  fontconfig/2.12.1-gimkl-2017a                 Python-Geo/2.7.14-gimkl-2017a
  fontconfig/2.13.1-GCCcore-7.4.0               Python-Geo/2.7.16-gimkl-2018b
  fontconfig/2.13.1-GCCcore-9.2.0        (D)    Python-Geo/3.6.3-gimkl-2017a
  forge/19.0                                    Python-Geo/3.7.3-gimkl-2018b
  forge/20.0.2                           (D)    Python-Geo/3.8.2-gimkl-2020a
  FoX/4.1.2-intel-2018b                         Python-Geo/3.9.5-gimkl-2020a           (D)
  FragGeneScan/1.31-gimkl-2018b                 Python-GPU/3.6.3-gimkl-2017a
  FreeBayes/1.1.0-gimkl-2017a                   Python/2.7.14-gimkl-2017a
  FreeBayes/1.3.1-GCC-7.4.0                     Python/2.7.16-gimkl-2018b
  FreeBayes/1.3.2-GCC-9.2.0              (D)    Python/2.7.16-intel-2018b
  freetype/2.7.1-gimkl-2017a                    Python/2.7.18-gimkl-2020a
  freetype/2.9.1-GCCcore-7.4.0                  Python/3.6.3-gimkl-2017a
  freetype/2.10.1-GCCcore-9.2.0          (D)    Python/3.7.3-gimkl-2018b
  FreeXL/1.0.2-gimkl-2017a                      Python/3.8.1-gimkl-2018b
  FreeXL/1.0.5-GCCcore-7.4.0             (D)    Python/3.8.2-gimkl-2020a                 (D) 
  FreeXL/1.0.5-GCCcore-9.2.0                    Python/3.9.5-gimkl-2020a
  FriBidi/1.0.10-GCCcore-9.2.0                  qcat/1.1.0-gimkl-2020a-Python-3.8.2

[removed most of the output here for clarity]

----------------------------------- /cm/local/modulefiles -----------------------------------
   cluster-tools/8.0    freeipmi/1.5.5     module-git     openmpi/mlnx/gcc/64/2.1.2a1
   cmd                  gcc/6.3.0          module-info    shared
   cuda-dcgm/1.3.3.1    ipmitool/1.8.18    null
   dot                  lua/5.3.4          openldap

  Where:
   D:  Default Module

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the
"keys".

Let’s take a closer look at the Python modules. There are many applications that are run using python and may fail to run if the wrong version is loaded. In this case, there are many different versions: Python/3.6.3-gimkl-2017a, Python/3.7.3-gimkl-2018b through to the newest versions.

How do we load each copy and which copy is the default?

In this case, Python/3.8.2-gimkl-2020a has a (D) next to it. This indicates that it is the default — if we type module load Python, as we did above, this is the copy that will be loaded.

[yourUsername@mahuika ~]$ module load Python
[yourUsername@mahuika ~]$ python3 --version
Python 3.8.2

So how do we load the non-default copy of a software package? In this case, the only change we need to make is be more specific about the module we are loading. There are many other Python versions. To load a non-default module, the only change we need to make to our module load command is to add the version number after the /.

[yourUsername@mahuika ~]$ module load Python/3.9.5-gimkl-2020a
The following have been reloaded with a version change:
  1) Python/3.8.2-gimkl-2020a => Python/3.9.5-gimkl-2020a

Notice how the module command has swapped out versions of the Python module. And now we test which version we are using:

[yourUsername@mahuika ~]$ python3 --version
Python 3.9.5

We are now left with only those module required to do our work for this project.

[yourUsername@mahuika ~]$ module list
  1) XALT/minimal                       20) HDF5/1.10.5-gimpi-2020a
  2) slurm                              21) libjpeg-turbo/2.0.2-GCCcore-9.2.0
  3) NeSI                          (S)  22) ncurses/6.1-GCCcore-9.2.0
  4) craype-broadwell                   23) libreadline/8.0-GCCcore-9.2.0
  5) craype-network-infiniband          24) libxml2/2.9.10-GCCcore-9.2.0
  6) GCCcore/9.2.0                      25) libxslt/1.1.34-GCCcore-9.2.0
  7) zlib/1.2.11-GCCcore-9.2.0          26) cURL/7.64.0-GCCcore-9.2.0
  8) binutils/2.32-GCCcore-9.2.0        27) PCRE/8.43-GCCcore-9.2.0
  9) GCC/9.2.0                          28) netCDF/4.7.3-gimpi-2020a
 10) libpmi                             29) SQLite/3.31.1-GCCcore-9.2.0
 11) impi/2019.6.166-GCC-9.2.0          30) METIS/5.1.0-GCCcore-9.2.0
 12) gimpi/2020a                        31) tbb/2019_U9-GCCcore-9.2.0
 13) imkl/2020.0.166-gimpi-2020a        32) SuiteSparse/5.6.0-gimkl-2020a-METIS-5.1.0
 14) gimkl/2020a                        33) Tcl/8.6.10-GCCcore-9.2.0
 15) bzip2/1.0.8-GCCcore-9.2.0          34) Tk/8.6.10-GCCcore-9.2.0
 16) XZ/5.2.4-GCCcore-9.2.0             35) LLVM/10.0.1-GCCcore-9.2.0
 17) libpng/1.6.37-GCCcore-9.2.0        36) OpenSSL/1.1.1k-GCCcore-9.2.0
 18) freetype/2.10.1-GCCcore-9.2.0      37) Python/3.9.5-gimkl-2020a
 19) Szip/2.1.1-GCCcore-9.2.0

Key Points

  • Load software with module load softwareName.

  • Unload software with module unload

  • The module system handles software versioning and package conflicts for you automatically.


Morning Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Key Points


Scheduler Fundamentals

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • What is a scheduler and why does a cluster need one?

  • How do I launch a program to run on a compute node in the cluster?

  • How do I capture the output of a program that is run on a node in the cluster?

Objectives
  • Run a simple script on the login node, and through the scheduler.

  • Use the batch system command line tools to monitor the execution of your job.

  • Inspect the output and error files of your jobs.

  • Find the right place to put large datasets on the cluster.

Job Scheduler

An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your job do not start instantly as in your laptop.

/hpc-intro/Compare%20a%20job%20scheduler%20to%20a%20waiter%20in%20a%20restaurant

The scheduler used in this lesson is Slurm. Although Slurm is not used everywhere, running jobs is quite similar regardless of what software is being used. The exact syntax might change, but the concepts remain the same.

Interactive vs Batch

So far, whenever we have entered a command into our terminals, we have received the response immediately in the same terminal, this is said to be an interactive session.

This is all well for doing small tasks, but what if we want to do several things one after another without without waiting in-between? Or what if we want to repeat a series of command again later?

This is where batch processing becomes useful, this is where instead of entering commands directly to the terminal we write them down in a text file or script. Then, the script can be executed by calling it with bash.

Lets try this now, create and open a new file in your current directory called example-job.sh. (If you prefer another text editor than nano, feel free to use that), we will put to use some things we have learnt so far.

[yourUsername@mahuika ~]$ nano example-job.sh
#!/bin/bash -e

module purge
module load R/4.3.1-gimkl-2022a
Rscript  array_sum.r 
echo "Done!"

shebang

shebang or shabang, also referred to as hashbang is the character sequence consisting of the number sign (aka: hash) and exclamation mark (aka: bang): #! at the beginning of a script. It is used to describe the interpreter that will be used to run the script. In this case we will be using the Bash Shell, which can be found at the path /bin/bash. The job scheduler will give you an error if your script does not start with a shebang.

We can now run this script using

[yourUsername@mahuika ~]$ bash example-job.sh
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
[1] "Using 1 cpus to sum [ 2.000000e+04 x 2.000000e+04 ] matrix."
[1] "0% done..."
...
[1] "99% done..."
[1] "100% done..."
[1] "Sum is '10403.632886'."
Done!

You will get the output printed to your terminal as if you had just run those commands one after another.

Cancelling Commands

You can kill a currently running task by pressing the keys ctrl + c. If you just want your terminal back, but want the task to continue running you can ‘background’ it by pressing ctrl + v. Note, a backgrounded task is still attached to your terminal session, and will be killed when you close the terminal (if you need to keep running a task after you log out, have a look at tmux).

Scheduled Batch Job

Up until now the scheduler has not been involved, our scripts were run directly on the login node (or Jupyter node).

First lets rename our batch script script to clarify that we intend to run it though the scheduler.

mv example-job.sh example-job.sl

File Extensions

A files extension in this case does not in any way affect how a script is read, it is just another part of the name used to remind users what type of file it is. Some common conventions:
.sh: Shell Script.
.sl: Slurm Script, a script that includes a slurm header and is intended to be submitted to the cluster.
.out: Commonly used to indicate the file contains the stdout of some process.
.err: Same as .out but for stderr.

In order for the job scheduler to do it’s job we need to provide a bit more information about our script. This is done by specifying slurm parameters in our batch script. Each of these parameters must be preceded by the special token #SBATCH and placed after the shebang, but before the content of the rest of your script.

/hpc-intro/slurm%20script%20is%20a%20regular%20bash%20script%20with%20a%20slurm%20header%20after%20the%20shebang

These parameters tell SLURM things around how the script should be run, like memory, cores and time required.

All the parameters available can be found by checking man sbatch or on the online slurm documentation.

--job-name #SBATCH --job-name=MyJob The name that will appear when using squeue or sacct
--account #SBATCH --account=nesi99991 The account your core hours will be 'charged' to.
--time #SBATCH --time=DD-HH:MM:SS Job max walltime
--mem #SBATCH --mem=1500M Memory required per node.
--output #SBATCH --output=%j_output.out Path and name of standard output file.
--ntasks #SBATCH --ntasks=2 Will start 2 MPI tasks.
--cpus-per-task #SBATCH --cpus-per-task=10

Will request 10 logical CPUs per task.

See Hyperthreading.

--partition #SBATCH --partition=milan

Requests that your script run on a specific subsection of the cluster.

The scheduler will generally try to determine this for you based on the resources requested, but you may need to set this manually.

Comments

Comments in UNIX shell scripts (denoted by #) are ignored by the bash interpreter. Why is it that we start our slurm parameters with # if it is going to be ignored?

Solution

Commented lines are ignored by the bash interpreter, but they are not ignored by slurm. The #SBATCH parameters are read by slurm when we submit the job. When the job starts, the bash interpreter will ignore all lines starting with #.

This is similar to the shebang mentioned earlier, when you run your script, the system looks at the #!, then uses the program at the subsequent path to interpret the script, in our case /bin/bash (the program ‘bash’ found in the ‘bin’ directory).

Note that just requesting these resources does not make your job run faster, nor does it necessarily mean that you will consume all of these resources. It only means that these are made available to you. Your job may end up using less memory, or less time, or fewer tasks or nodes, than you have requested, and it will still run.

It’s best if your requests accurately reflect your job’s requirements. We’ll talk more about how to make sure that you’re using resources effectively in a later episode of this lesson.

Now, rather than running our script with bash we submit it to the scheduler using the command sbatch (slurm batch).

[yourUsername@mahuika ~]$ sbatch example-job.sl
Submitted batch job 23137702

And that’s all we need to do to submit a job. Our work is done – now the scheduler takes over and tries to run the job for us.

Checking on Running/Pending Jobs

While the job is waiting to run, it goes into a list of jobs called the queue. To check on our job’s status, we check the queue using the command squeue (slurm queue). We will need to filter to see only our jobs, by including either the flag --user <username> or --me.

[yourUsername@mahuika ~]$ squeue --me
JOBID   USER         ACCOUNT   NAME           CPUS MIN_MEM PARTITI START_TIME  TIME_LEFT STATE    NODELIST(REASON)
231964  yourUsername nesi99991 example-job.sl 1    512M     large   N/A        1:00     PENDING  (Priority)

We can see many details about our job, most importantly is it’s STATE, the most common states you might see are..

Cancelling Jobs

Sometimes we’ll make a mistake and need to cancel a job. This can be done with the scancel command.

In order to cancel the job, we will first need its ‘JobId’, this can be found in the output of ‘squeue –me’.

[yourUsername@mahuika ~]$ scancel 231964

A clean return of your command prompt indicates that the request to cancel the job was successful.

Now checking squeue again, the job should be gone.

[yourUsername@mahuika ~]$ squeue --me
JOBID   USER         ACCOUNT   NAME           CPUS MIN_MEM PARTITI START_TIME  TIME_LEFT STATE    NODELIST(REASON)

(If it isn’t wait a few seconds and try again).

Cancelling multiple jobs

We can also cancel all of our jobs at once using the -u option. This will delete all jobs for a specific user (in this case, yourself). Note that you can only delete your own jobs.

Try submitting multiple jobs and then cancelling them all.

Solution

First, submit a trio of jobs:

[yourUsername@mahuika ~]$ sbatch  example-job.sl
[yourUsername@mahuika ~]$ sbatch  example-job.sl
[yourUsername@mahuika ~]$ sbatch  example-job.sl

Then, cancel them all:

[yourUsername@mahuika ~]$ scancel --user yourUsername

Checking Finished Jobs

There is another command sacct (slurm account) that includes jobs that have finished. By default sacct only includes jobs submitted by you, so no need to include additional commands at this point.

[yourUsername@mahuika ~]$ sacct
JobID           JobName          Alloc     Elapsed     TotalCPU  ReqMem   MaxRSS State      
--------------- ---------------- ----- ----------- ------------ ------- -------- ---------- 
31060451        example-job.sl       2    00:00:48    00:33.548      1G          CANCELLED  
31060451.batch  batch                2    00:00:48    00:33.547          102048K CANCELLED  
31060451.extern extern               2    00:00:48     00:00:00                0 CANCELLED  

Note that despite the fact that we have only run one job, there are three lines shown, this because each job step is also shown. This can be suppressed using the flag -X.

Where’s the Output?

On the login node, when we ran the bash script, the output was printed to the terminal. Slurm batch job output is typically redirected to a file, by default this will be a file named slurm-<job-id>.out in the directory where the job was submitted, this can be changed with the slurm parameter --output.

Hint

You can use the manual pages for Slurm utilities to find more about their capabilities. On the command line, these are accessed through the man utility: run man <program-name>. You can find the same information online by searching > “man ".

[yourUsername@mahuika ~]$ man sbatch

Job environment variables

When Slurm runs a job, it sets a number of environment variables for the job. One of these will let us check what directory our job script was submitted from. The SLURM_SUBMIT_DIR variable is set to the directory from which our job was submitted. Using the SLURM_SUBMIT_DIR variable, modify your job so that it prints out the location from which the job was submitted.

Solution

[yourUsername@mahuika ~]$ nano example-job.sh
[yourUsername@mahuika ~]$ cat example-job.sh
#!/bin/bash -e
#SBATCH --time 00:00:30

echo -n "This script is running on "
hostname

echo "This job was launched in the following directory:"
echo ${SLURM_SUBMIT_DIR}

Key Points

  • The scheduler handles how compute resources are shared between users.

  • A job is just a shell script.

  • Request slightly more resources than you will need.


What is Parallel Computing

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How do we execute a task in parallel?

  • What benefits arise from parallel execution?

  • What are the limits of gains from execution in parallel?

  • What is the difference between implicit and explicit parallelisation.

Objectives
  • Prepare a job submission script for the parallel executable.

Methods of Parallel Computing

To understand the different types of Parallel Computing we first need to clarify some terms.

/hpc-intro/Node%20anatomy

CPU: Unit that does the computations.

Task: One or more CPUs that share memory.

Node: The physical hardware. The upper limit on how many CPUs can be in a task.

Shared Memory: When multiple CPUs are used within a single task.

Distributed Memory: When multiple tasks are used.

Which methods are available to you is largely dependent on the nature of the problem and software being used.

Shared-Memory (SMP)

Shared-memory multiproccessing divides work among CPUs or threads, all of these threads require access to the same memory.

Often called Multithreading.

This means that all CPUs must be on the same node, most Mahuika nodes have 72 CPUs.

Shared memory parallelism is what is used in our example script array_sum.r.

Number of threads to use is specified by the Slurm option --cpus-per-task.

Shared Memory Example

Create a new script called smp-job.sl

#!/bin/bash -e

#SBATCH --job-name        smp-job
#SBATCH --account         nesi99991
#SBATCH --output          %x.out
#SBATCH --mem-per-cpu     500
#SBATCH --cpus-per-task   8

echo "I am task #${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs"

then submit with

[yourUsername@mahuika ~]$ sbatch smp-job.sl

Solution

Checking the output should reveal

[yourUsername@mahuika ~]$ cat smp-job.out
I am task #0 running on node 'wbn224' with 8 CPUs

Distributed-Memory (MPI)

Distributed-memory multiproccessing divides work among tasks, a task may contain multiple CPUs (provided they all share memory, as discussed previously).

Message Passing Interface (MPI) is a communication standard for distributed-memory multiproccessing. While there are other standards, often ‘MPI’ is used synonymously with Distributed parallelism.

Each task has it’s own exclusive memory, tasks can be spread across multiple nodes, communicating via and interconnect. This allows MPI jobs to be much larger than shared memory jobs. It also means that memory requirements are more likely to increase proportionally with CPUs.

Distributed-Memory multiproccessing predates shared-memory multiproccessing, and is more common with classical high performance applications (older computers had one CPU per node).

Number of tasks to use is specified by the Slurm option --ntasks, because the number of tasks ending up on one node is variable you should use --mem-per-cpu rather than --mem to ensure each task has enough.

Tasks cannot share cores, this means in most circumstances leaving --cpus-per-task unspecified will get you 2.

Distributed Memory Example

Create a new script called mpi-job.sl

#!/bin/bash -e

#SBATCH --job-name        mpi-job
#SBATCH --account         nesi99991 
#SBATCH --output          %x.out
#SBATCH --mem-per-cpu     500
#SBATCH --ntasks          4

srun bash -c 'echo I am task \#${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs'

then submit with

[yourUsername@mahuika ~]$ sbatch mpi-job.sl

Solution

[yourUsername@mahuika ~]$ cat mpi-job.out
I am task #1 running on node 'wbn012' with 2 CPUs
I am task #3 running on node 'wbn010' with 2 CPUs
I am task #0 running on node 'wbn009' with 2 CPUs
I am task #2 running on node 'wbn063' with 2 CPUs

Using a combination of Shared and Distributed memory is called Hybrid Parallel.

Hybrid Example

Create a new script called hybrid-job.sl

#!/bin/bash -e

#SBATCH --job-name        hybrid-job
#SBATCH --account         nesi99991 
#SBATCH --output          %x.out
#SBATCH --mem-per-cpu     500
#SBATCH --ntasks          2
#SBATCH --cpus-per-task   4

srun bash -c 'echo I am task \#${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs'
[yourUsername@mahuika ~]$ sbatch hybrid-job.sl

Solution

[yourUsername@mahuika ~]$ cat hybrid-job.out

I am task #0 running on node 'wbn016' with 4 CPUs
I am task #1 running on node 'wbn022' with 4 CPUs

GPGPU’s

GPUs compute large number of simple operation in parallel, making them well suited for Graphics Processing (hence the name), or any other large matrix operations.

On NeSI, GPU’s are specialised pieces of hardware that you request in addition to your CPUs and memory.

You can find an up-to-date(ish) list of GPUs available on NeSI in our Support Documentation

GPUs can be requested using --gpus-per-node=<gpu_type>:<gpu_number>

Depending on the GPU type, we may also need to specify a partition using --partition.

GPU Job Example

Create a new script called gpu-job.sl

#!/bin/bash -e

#SBATCH --job-name        gpu-job
#SBATCH --account         nesi99991 
#SBATCH --output          %x.out
#SBATCH --mem-per-cpu     2G
#SBATCH --gpu-per-node    P100:1

module load CUDA
nvidia-smi  

then submit with

[yourUsername@mahuika ~]$ sbatch gpu-job.sl

Solution

[yourUsername@mahuika ~]$ cat gpu-job.out

Tue Mar 12 19:40:51 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   28C    P0    24W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Job Array

Job arrays are not “multiproccessing” in the same way as the previous two methods. Ideal for embarrassingly parallel problems, where there are little to no dependencies between the different jobs.

Can be thought of less as running a single job in parallel and more about running multiple serial-jobs simultaneously. Often this will involve running the same process is run on multiple inputs.

Embarrassingly parallel jobs should be able scale without any loss of efficiency. If this type of parallelisation is an option, it will almost certainly be the best choice.

A job array can be specified using --array

If you are writing your own code, then this is something you will probably have to specify yourself.

Job Array Example

Create a new script called array-job.sl

#!/bin/bash -e

#SBATCH --job-name        array-job
#SBATCH --account         nesi99991
#SBATCH --output          %x_%a.out
#SBATCH --mem-per-cpu     500
#SBATCH --array           0-3

echo "I am task #${SLURM_PROCID} running on node '$(hostname)' with $(nproc) CPUs"

then submit with

[yourUsername@mahuika ~]$ sbatch array-job.sl

Solution

ls
array-job_0.out array-job_1.out array-job_2.out array-job_3.out

Each of which should contain,

 [yourUsername@mahuika ~]$ cat array-job*.out
I am task #0 running on node 'wbn*' with 2 CPUs

How to Utilise Multiple CPUs

Requesting extra resources through Slurm only means that more resources will be available, it does not guarantee your program will be able to make use of them.

Generally speaking, Parallelism is either implicit where the software figures out everything behind the scenes, or explicit where the software requires extra direction from the user.

Scientific Software

The first step when looking to run particular software should always be to read the documentation. On one end of the scale, some software may claim to make use of multiple cores implicitly, but this should be verified as the methods used to determine available resources are not guaranteed to work.

Some software will require you to specify number of cores (e.g. -n 8 or -np 16), or even type of paralellisation (e.g. -dis or -mpi=intelmpi).

Occasionally your input files may require rewriting/regenerating for every new CPU combintation (e.g. domain based parallelism without automatic partitioning).

Writing Code

Occasionally requesting more CPUs in your Slurm job is all that is required and whatever program you are running will automagically take advantage of the additional resources. However, it’s more likely to require some amount of effort on your behalf.

It is important to determine this before you start requesting more resources through Slurm

If you are writing your own code, some programming languages will have functions that can make use of multiple CPUs without requiring you to changes your code. However, unless that function is where the majority of time is spent, this is unlikely to give you the performance you are looking for.

Python: Multiproccessing (not to be confused with threading which is not really parallel.)

MATLAB: Parpool

Key Points

  • Parallel programming allows applications to take advantage of parallel hardware; serial code will not ‘just work.’

  • There are multiple ways you can run


Afternoon Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Key Points


Using resources effectively

Overview

Teaching: 25 min
Exercises: 10 min
Questions
  • How can I review past jobs?

  • How can I use this knowledge to create a more accurate submission script?

Objectives
  • Understand how to look up job statistics and profile code.

  • Understand job size implications.

  • Understand problems and limitations involved in using multiple CPUs.

What Resources?

Last time we submitted a job, we did not specify a number of CPUs, and therefore we were provided the default of 2 (1 core).

As a reminder, our slurm script example-job.sl currently looks like this.

#!/bin/bash -e

#SBATCH --job-name      my_job
#SBATCH --account       nesi99991
#SBATCH --mem           300M
#SBATCH --time          00:15:00

module purge
module load R/4.3.1-gimkl-2022a
Rscript  array_sum.r 
echo "Done!"

We will now submit the same job again with more CPUs. We ask for more CPUs using by adding #SBATCH --cpus-per-task 4 to our script. Your script should now look like this:

#!/bin/bash -e

#SBATCH --job-name      my_job
#SBATCH --account       nesi99991
#SBATCH --mem           300M
#SBATCH --time          00:15:00
#SBATCH --cpus-per-task 4

module purge
module load R/4.3.1-gimkl-2022a
Rscript   array_sum.r 
echo "Done!"

And then submit using sbatch as we did before.

[yourUsername@mahuika ~]$ sbatch example-job.sl
Submitted batch job 23137702

Watch

We can prepend any command with watch in order to periodically (default 2 seconds) run a command. e.g. watch squeue --me will give us up to date information on our running jobs. Care should be used when using watch as repeatedly running a command can have adverse effects. Exit watch with ctrl + c.

Note in squeue, the number under cpus, should be ‘4’.

Checking on our job with sacct. Oh no!

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
27323464         my_job      large  nesi99991          4 OUT_OF_ME+    0:125 
27323464.ba+      batch             nesi99991          4 OUT_OF_ME+    0:125 
27323464.ex+     extern             nesi99991          4  COMPLETED      0:0 

To understand why our job failed, we need to talk about the resources involved.

Understanding the resources you have available and how to use them most efficiently is a vital skill in high performance computing.

Below is a table of common resources and issues you may face if you do not request the correct amount.

Not enough Too Much
CPU The job will run more slowly than expected, and so may run out of time and get killed for exceeding its time limit. The job will wait in the queue for longer.
You will be charged for CPUs regardless of whether they are used or not.
Your fair share score will fall more.
Memory Your job will fail, probably with an 'OUT OF MEMORY' error, segmentation fault or bus error (may not happen immediately). The job will wait in the queue for longer.
You will be charged for memory regardless of whether it is used or not.
Your fair share score will fall more.
Walltime The job will run out of time and be terminated by the scheduler. The job will wait in the queue for longer.

Measuring Resource Usage of a Finished Job

Since we have already run a job (successful or otherwise), this is the best source of info we currently have. If we check the status of our finished job using the sacct command we learned earlier.

[yourUsername@mahuika ~]$ sacct
JobID           JobName          Alloc     Elapsed     TotalCPU  ReqMem   MaxRSS State      
--------------- ---------------- ----- ----------- ------------ ------- -------- ---------- 
31060451        example-job.sl       2    00:00:48    00:33.548      1G          CANCELLED  
31060451.batch  batch                2    00:00:48    00:33.547          102048K CANCELLED  
31060451.extern extern               2    00:00:48     00:00:00                0 CANCELLED  

With this information, we may determine a couple of things.

Memory efficiency can be determined by comparing ReqMem (requested memory) with MaxRSS (maximum used memory), MaxRSS is given in KB, so a unit conversion is usually required.

/hpc-intro/Memory%20Efficiency%20Formula

So for the above example we see that 0.1GB (102048K) of our requested 1GB meaning the memory efficincy was about 10%.

CPU efficiency can be determined by comparing TotalCPU(CPU time), with the maximum possible CPU time. The maximum possible CPU time equal to Alloc (number of allocated CPUs) multiplied by Elapsed (Walltime, actual time passed).

/hpc-intro/CPU%20Efficiency%20Formula

For the above example 33 seconds of computation was done,

where the maximum possible computation time was 96 seconds (2 CPUs multiplied by 48 seconds), meaning the CPU efficiency was about 35%.

Time Efficiency is simply the Elapsed Time divided by Time Requested.

/hpc-intro/Time%20Efficiency%20Formula

48 seconcds out of 15 minutes requested give a time efficiency of about 5%

Efficiency Exercise

Calculate for the job shown below,

JobID           JobName          Alloc     Elapsed     TotalCPU  ReqMem   MaxRSS State
--------------- ---------------- ----- ----------- ------------ ------- -------- ----------
37171050        Example-job          8    00:06:03     00:23:04     32G           FAILED
37171050.batch  batch                8    00:06:03    23:03.999         14082672k FAILED
37171050.extern extern               8    00:06:03    00:00.001                0  COMPLETED

a. CPU efficiency.

b. Memory efficiency.

Solution

a. CPU efficiency is ( 23 / ( 8 * 6 ) ) x 100 or around 48%.

b. Memory efficiency is ( 14 / 32 ) x 100 or around 43%.

For convenience, NeSI has provided the command nn_seff <jobid> to calculate Slurm Efficiency (all NeSI commands start with nn_, for NeSI NIWA).

[yourUsername@mahuika ~]$ nn_seff <jobid>
Job ID: 27323570
Cluster: mahuika
User/Group: username/username
State: COMPLETED (exit code 0)
Cores: 1
Tasks: 1
Nodes: 1
Job Wall-time:  5.11%  00:00:46 of 00:15:00 time limit
CPU Efficiency: 141.30%  00:01:05 of 00:00:46 core-walltime
Mem Efficiency: 93.31%  233.29 MB of 250.00 MB

Knowing what we do now about job efficiency, lets submit the previous job again but with more appropriate resources.

#!/bin/bash -e

#SBATCH --job-name      my_job
#SBATCH --account       nesi99991
#SBATCH --mem           300M
#SBATCH --time          00:15:00
#SBATCH --cpus-per-task 4

module purge
module load R/4.3.1-gimkl-2022a
Rscript   array_sum.r 
echo "Done!"

[yourUsername@mahuika ~]$ sbatch example-job.sl

Hopefully we will have better luck with this one!

A quick description of Simultaneous Multithreading - SMT (aka Hyperthreading)

Modern CPU cores have 2 threads of operation that can execute independently of one another. SMT is the technology that allows the 2 threads within one physical core to present as multiple logical cores, sometimes referred to as virtual CPUS (vCPUS).

Note: Hyperthreading is Intel’s marketing name for SMT. Both Intel and AMD CPUs have SMT technology.

Some types of processes can take advantage of multiple threads, and can gain a performance boost. Some software is specifically written as multi-threaded. You will need to check or test if your code can take advantage of threads (we can help with this).

However, because each thread shares resources on the physical core, there can be conflicts for resources such as onboard cache. This is why not all processes get a performance boost from SMT and in fact can run slower. These types of jobs should be run without multithreading. There is a Slurm parameter for this: --hint=nomultithread

SMT is why you are provided 2 CPUs instead of 1 as we do not allow 2 different jobs to share a core. This also explains why you will sometimes see CPU efficiency above 100%, since CPU efficiency is based on core and not thread.

For more details please see our documentation on Hyperthreading

Measuring the System Load From Currently Running Tasks

On Mahuika, we allow users to connect directly to compute nodes from the login node. This is useful to check on a running job and see how it’s doing, however, we only allow you to connect to nodes on which you have running jobs.

The most reliable way to check current system stats is with htop. htop is an interactive process viewer that can be launched from command line.

Finding job node

Before we can check on our job, we need to find out where it is running. We can do this with the command squeue --me, and looking under the ‘NODELIST’ column.

[yourUsername@mahuika ~]$ squeue --me
JOBID         USER     ACCOUNT   NAME        CPUS MIN_MEM PARTITI START_TIME     TIME_LEFT STATE    NODELIST(REASON)    
26763045      cwal219  nesi99991 test           2    512M large   May 11 11:35       14:46 RUNNING  wbn144 

Now that we know the location of the job (wbn189) we can use ssh to run htop on that node.

[yourUsername@mahuika ~]$ ssh wbn189 -t htop -u $USER

You may get a message:

ECDSA key fingerprint is SHA256:############################################
ECDSA key fingerprint is MD5:9d:############################################
Are you sure you want to continue connecting (yes/no)?

If so, type yes and Enter

You may also need to enter your cluster password.

If you cannot connect, it may be that the job has finished and you have lost permission to ssh to that node.

Reading Htop

You may see something like this,

top - 21:00:19 up  3:07,  1 user,  load average: 1.06, 1.05, 0.96
Tasks: 311 total,   1 running, 222 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.2 us,  3.2 sy,  0.0 ni, 89.0 id,  0.0 wa,  0.2 hi,  0.2 si,  0.0 st
KiB Mem : 16303428 total,  8454704 free,  3194668 used,  4654056 buff/cache
KiB Swap:  8220668 total,  8220668 free,        0 used. 11628168 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1693 jeff      20   0 4270580 346944 171372 S  29.8  2.1   9:31.89 gnome-shell
 3140 jeff      20   0 3142044 928972 389716 S  27.5  5.7  13:30.29 Web Content
 3057 jeff      20   0 3115900 521368 231288 S  18.9  3.2  10:27.71 firefox
 6007 jeff      20   0  813992 112336  75592 S   4.3  0.7   0:28.25 tilix
 1742 jeff      20   0  975080 164508 130624 S   2.0  1.0   3:29.83 Xwayland
    1 root      20   0  230484  11924   7544 S   0.3  0.1   0:06.08 systemd
   68 root      20   0       0      0      0 I   0.3  0.0   0:01.25 kworker/4:1
 2913 jeff      20   0  965620  47892  37432 S   0.3  0.3   0:11.76 code
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd

Overview of the most important fields:

To exit press q.

Running this command as is will show us information on tasks running on the login node (where we should not be running resource intensive jobs anyway).

Running Test Jobs

As you may have to run several iterations before you get it right, you should choose your test job carefully. A test job should not run for more than 15 mins. This could involve using a smaller input, coarser parameters or using a subset of the calculations. As well as being quick to run, you want your test job to be quick to start (e.g. get through queue quickly), the best way to ensure this is keep the resources requested (memory, CPUs, time) small. Similar as possible to actual jobs e.g. same functions etc. Use same workflow. (most issues are caused by small issues, typos, missing files etc, your test job is a jood chance to sort out these issues.). Make sure outputs are going somewhere you can see them.

Serial Test

Often a good first test to run, is to execute your job serially e.g. using only 1 CPU. This not only saves you time by being fast to start, but serial jobs can often be easier to debug. If you confirm your job works in its most simple state you can identify problems caused by paralellistaion much more easily.

You generally should ask for 20% to 30% more time and memory than you think the job will use. Testing allows you to become more more precise with your resource requests. We will cover a bit more on running tests in the last lesson.

Efficient way to run tests jobs using debug QOS (Quality of Service)

Before submitting a large job, first submit one as a test to make sure everything works as expected. Often, users discover typos in their submit scripts, incorrect module names or possibly an incorrect pathname after their job has queued for many hours. Be aware that your job is not fully scanned for correctness when you submit the job. While you may get an immediate error if your SBATCH directives are malformed, it is not until the job starts to run that the interpreter starts to process the batch script.

NeSI has an easy way for you to test your job submission. One can employ the debug QOS to get a short, high priority test job. Debug jobs have to run within 15 minutes and cannot use more that 2 nodes. To use debug QOS, add or change the following in your batch submit script

#SBATCH --qos=debug
#SBATCH --time=15:00

Adding these SBATCH directives will provide your job with the highest priority possible, meaning it should start to run within a few minutes, provided your resource request is not too large.

Initial Resource Requirements

As we have just discussed, the best and most reliable method of determining resource requirements is from testing, but before we run our first test there are a couple of things you can do to start yourself off in the right area.

Read the Documentation

NeSI maintains documentation that does have some guidance on using resources for some software However, as you noticed in the Modules lessons, we have a lot of software. So it is also advised to search the web for others that may have written up guidance for getting the most out of your specific software.

Ask Other Users

If you know someone who has used the software before, they may be able to give you a ballpark figure.

Next Steps

You can use this knowledge to set up the next job with a closer estimate of its load on the system. A good general rule is to ask the scheduler for 30% more time and memory than you expect the job to need.

Key Points

  • As your task gets larger, so does the potential for inefficiencies.

  • The smaller your job (time, CPUs, memory, etc), the faster it will schedule.


Scaling

Overview

Teaching: 5 min
Exercises: 30 min
Questions
  • How do we go from running a job on a small number of CPUs to a larger one.

Objectives
  • Understand scaling procedure.

The aim of these tests will be to establish how a jobs requirements change with size (CPUs, inputs) and ultimately figure out the best way to run your jobs. Unfortunately we cannot assume speedup will be linear (e.g. double CPUs won’t usually half runtime, doubling the size of your input data won’t necessarily double runtime) therefore more testing is required. This is called scaling testing.

In order to establish an understanding of the scaling properties we may have to repeat this test several times, giving more resources each iteration.

Scaling Behavior

Amdahl’s Law

Most computational tasks will have a certain amount of work that must be computed serially.

Larger fractions of parallel code will have closer to linear scaling performance.

Eventually your performance gains will plateau.

The fraction of the task that can be run in parallel determines the point of this plateau. Code that has no serial components is said to be “embarrassingly parallel”.

It is worth noting that Amdahl’s law assumes all other elements of scaling are happening with 100% efficient, in reality there are additional computational and communication overheads.

Scaling Exercise

  1. Find your name in the spreadsheet and modify your example-job.sl to request “x” --cpus-per-task. For example #SBATCH --cpus-per-task 10.
  2. Estimate memory requirement based on our previous runs and the cpus requested, memory is specified with the --mem flag, it does not accept decimal values, however you may specify a unit (K|M|G), if no unit is specified it is assumed to be M. For example #SBATCH --mem 1200.
  3. Now submit your job, we will include an extra argument --acctg-freq 1. By default SLURM records job data every 30 seconds. This means any job running for less than 30 seconds will not have it’s memory use recorded. Submit the job with sbatch --acctg-freq 1 example-job.sl.
  4. Watch the job with squeue --me or watch squeue --me.
  5. On completion of job, use nn_seff <job-id>.
  6. Record the jobs “Elapsed”, “TotalCPU”, and “Memory” values in the spreadsheet. (Hint: They are the first numbers after the percentage efficiency in output of nn_seff). Make sure you have entered the values in the correct format and there is a tick next to each entry. Correctly entered data in spreadsheet.

Solution

spreadsheet

Key Points

  • Start small.

  • Test one thing at a time (unit tests).

  • Record everything.