NPRG058: Advanced Programming in Parallel Environment

Assignment 2 - CUDA Histogram II

We continue with the Histogram saga (assuming you have completed the first assignment).

In this episode, we will try to improve the performance of the kernel with atomic updates using three optimizations:

Privatization: In the case of a small histogram, atomic update collisions are frequent. Make multiple copies of the histogram and aggregate them at the end. It is not feasible to have a copy for every thread, but we can make a fixed number of copies and assign threads (or thread blocks) to these copies in a round-robin manner. Note that there is a prepared application argument called --privCopies that you can use to set the number of copies.
Shared Memory Caching: Using shared memory is another specialized case of privatization. If the histogram fits in shared memory, we can aggregate updates from one thread block here and merge the local copy of the histogram with the global copy at the end. In the case of small-enough histograms, it may be beneficial to create even multiple copies inside shared memory (ideally, one copy per memory bank).
Work Aggregation: To fully utilize shared memory, it is better to aggregate as many updates as possible. That is, one CUDA thread should process more than one input element. Make this amount (of processed items per thread) a tunable parameter and try to find the right value empirically. Note that there is a prepared application argument called --itemsPerThread that you can use to set this parameter.

Experiment! Implement all approaches (or their combinations) and empirically determine which approaches are better and when (i.e., test them for different inputs and different histogram sizes).