NPRG042 Programming in Parallel Environment
Labs 03 - OpenMP
Matrix multiplication
The first task is to accelerate matrix multiplication using OpenMP. The initial source code with the serial solution is ready in the /home/_teaching/para/labs/omp/matrix-mul directory on parlab. Open matrix-mul.cpp and parallelize the code in the para_matrix_mul() function.
Hint: Let's use parallel for! 🎉
Adjust the N constant in the main() function to select the matrix size: 1024 is a good number for performance testing; 256 is fine for debugging, 2048 will take 28 seconds to compute (sequentially) on w201 (do not use larger matrices, they stall the worker for too long).
Searching for minimum
The second task is to parallelize the search for the minimum in an array of numbers. The initial source code with the serial solution is ready in /home/_teaching/para/labs/omp/minimum; edit the para_min() function in the minimum.cpp file.
You may adjust the value passed to generate_data() in the main() function to experiment with different array sizes. The default value is 1024*1024 (1M numbers).
Stretch goal: Try to parallelize the search for k smallest values in the array where k is selected much smaller than the size of the array (tens or hundreds at most).
Compilation and execution on parlab
For compilation, you may use the provided Makefile and optionally the build.sh sbatch script. Remember, this is not an ordinary shell script, but it needs to be executed via Slurm as:
$> sbatch ./build.sh
It will enqueue the job and execute it once the resources become available. The job ID is returned by the sbatch command. You can check the job status using the squeue command or the sacct -j <jobID> command. Once the job is finished, the output is stored in the ./build.log file.
If you prefer to run make directly from the terminal, use srun:
$> srun -A nprg042s -p mpi-homo-short -c 1 -n 1 make
Similarly, you can run the compiled program using
$> sbatch ./run.sh
It writes the output (measured times) to run.log. Alternatively, you can run it interactively as:
$> srun -A nprg042s -p mpi-homo-short -c 32 -n 1 ./<executable-name> [ args... ]
Feel free to experiment with the number of allocated processors (-c argument) and the size of the input data. For debugging, it is recommended to keep the size small (up to 256 in the case of matmul, a few thousand for minimum) and use only 4-8 CPUs so you can better share resources in the class. The 32 cores are a reasonable maximum (64 cores are likely to exhibit no improvement in the case of minimum, since the task is memory-bound).