Profiling an OpenMP program with MAP
- Objectives
- Code example
- Using MAP to profile an OpenMP executable
- Exercises
- Interpreting the profiling information
- Exercises
Objectives
You will:
- learn how to use MAP to profile an OpenMP code
- learn how to interpret MAP multithreaded profiling data
Code example
We’ll use the scatter.py code in directory openmp
of the solutions branch. Start by
git fetch --all
git checkout solutions
cd openmp
Using MAP to profile an OpenMP executable
To use MAP we need to load the forge module in our batch script and insert map --profile
between srun
and the executable. See for example
ml forge
srun map --profile python scatter.py
in the Slurm script “scatter.sl”.
Note: command map --profile
must follow srun
in the case of a serial/threaded program. (For MPI programs map --profile
should precede srun
.)
Exercises
- edit script
scatter.sl
:
- apply 8 OpenMP threads
- load module
forge
- add command
map --profile
to the executable- submit the job
Interpreting the profiling information
Upon execution, a file with subscript .map will be generated. The results can be viewed with the command map,
map python_scatter_py_1p_1n_8t_2019-05-24_00-00.map
(the .map file name will vary with each run.)
Below is an example of a profiling data obtained by running python scatter.py -nx 256 -ny 256
and using 8 OpenMP threads.
On top, the activity window shows the time spent between I/O (orange), serial computing (dark green) and parallel computing (light green). The orange parts amount to initialisation, a one off cost that does not increase with problem size and which is therefore not of great interest here. (Loading shared libraries such as numpy
are responsible for the orange I/O activity.)
More interestingly, we see that 73 percent of the time is spent in the serial part of the code and 9 percent in the parallel part. The parallel part is the one that decreases as we throw more threads to the problem. This suggests that we are close to achieving the maximum scalability of the program with 8 threads - adding more threads can only reduce the execution time by 9 percent at most.
Also of interest, we observe that more than 50 percent of the execution time involves four lines of code (96, 101, 102 and 105). Lines 96, 101 and 102 all are purely serial and involve casting a numpy array into a C pointer which can be passed to a C function. Together these lines consume 36 percent of the execution time.
Exercises
- edit scatter.py:
- remove lines 101 and 102 (which are superfluous)
- move line 96 out of the loop (kvec is constant)
- regenerate the profiling data and compare the newly obtained profiling data with the previously obtained data
- how did the contributions of parallel and serial execution times relative to the total time change?
- how did the the contribution of function
computeScatterWave
to the total execution time change?