What you have covered
You have learned how to profile a Python code to identify performance hot spots. You also learned that the coding style can have an impact on the performance. Vectorisation, for instance, brought an 7-8x speedup over the original code.
Additional improvements were achieved by migrating some parts of the code to C/C++. If the computational kernels are simple enough to be automatically translated to C then
numba is an attractive option (16-17x speedup). Writing by hand the C/C++ extension brings a 20-30x speedup over the original code.
Throwing more resources at the problem is another way to reduce wall clock time. For our problem, we obtained a 3-4x speedup with 8 threads and a 5-6x speedup with MPI using 8 processes.
The above strategies can be combined. For our test problem, the best results were obtained by applying OpenMP threading to the loops coded in C/C++. With a little additional tuning, a 110x speedup for 8 threads over the original code.
Your mileage may vary - all optimisation techniques presented here are problem type and size dependent. You should not expect the same speedup values for other problems.
Want to learn more? Here is some material which we have found useful: