NPRG058: Advanced Programming in Parallel Environment

Assignment 2 - CUDA Histogram I

Write a CUDA-accelerated implementation that computes a histogram (frequency analysis) of a text file. The input is plain text in which one character is one byte (we do not address encoding details). The histogram holds the number of occurrences for each character. The range of characters in the histogram may be adjusted via algorithm parameters. The range 0-127 is computed by default.

The initial source code is in /home/_teaching/advpara/ha2-cuda-histogram. The CUDA kernels, along with the C function(s) that invoke them (runners), should be in kernels/kernels.cu. You also need to copy the runner header into headers/kernels.cuh, and don't forget to explicitly instantiate its C++ template with the appropriate parameters. It is sufficient if the kernel is executable with uint8 as the input element type and uint32 as the result type.

Each algorithm has to be wrapped in a class that inherits from IHistogramAlgorithm. Place your CUDA algorithm wrappers into headers/cuda.hpp. You will also need to register your algorithm wrapper (create a unique pointer instance) in the getAlgorithm function of histogram.cpp. See the serial algorithm and add your algorithms analogously.

In the first episode of the Histogram saga, you need to implement two simple approaches:

A trivial sync-free approach in which one thread computes one bin of the final histogram (iterating over the entire input). This way, no explicit synchronization is required; however, a low number of bins will not saturate the GPU.
A straightforward atomic solution in which one thread handles one input character and increments the corresponding histogram bin (in global memory) using atomic instructions.

Check out all the input files and compare the performance of the two approaches.

Stretch Goal: Try to use a warp-opportunistic approach to aggregate atomic updates within one warp (see the lectures for how to achieve that).