NPRG058: Advanced Programming in Parallel Environment

Lab practice 03 - CUDA blur stencil

The objective is to implement a blur stencil using CUDA. The stencil computes a blurry image out of a regular image using simple averaging. Each new pixel is a weighted average of all pixels within its radius r (i.e., from [x-r, y-r] to [x+r, y+r] covering (2r+1)^2 large window). The weight is 1/d where d is the Manhattan distance from the center (x,y). The center pixel uses explicit weight 5 (to avoid division by zero).

Use the starter pack at /home/_teaching/advpara/labs/03-cuda-blur. Copy it to your home and modify the cuda_blur function (you will also need to create a new kernel for the computation). Read the serial_blur function (the referential implementation) first. Use the attached Makefile for build and run.sh script for baseline measurement. Sample images are included in the starter pack (data subdir).

Note the CUCH macro and how it is used to wrap CUDA calls (for better error handling).

Evaluation: You may have noticed that CUDA implementation is somewhat slower on GPU than on CPU. Find out why. Print out times of individual steps/CUDA calls (do not forget that kernel call is asynchronous).

Stretch goal: One of the issues of this stencil is that it is quite data-bound. Contemplate possibilities of how to improve the data caching and re-use (trying to remember what you learnt in basic parallel programming course).

Remarks and hints about gpulab usage

Use gpulab.ms.mff.cuni.cz server as the head node (not the parlab, which was used in previous labs).
Do not use VS Code remote SSH extension (it consumes too much memory on the head node and may cause crashes).
Use gpu-short-teach partition (queue) with your nprg058s account for submitting the jobs. Also do not forget --gpus (or --gres) parameter to allocate the GPU. (allocate it also for compilation, otherwise the srun may hang out). I.e.,

srun -A nprg058s -p gpu-short-teach --gpus=V100:1 make
srun -A nprg058s -p gpu-short-teach --gpus=V100:1 ./cuda-blur-stencil args...

We will use mainly the Volta GPUs (V100 on workers volta01-05). However, Volta architecture is a little obsolete now (not supported by newest CUDA framework). If you wish to experiment with other GPUs (Ampere, Ada, Hopper, or Blackwell), you may, but you need to fix the makefile and adjust the --gpus argument value properly (and recompile the solution).