NPRG042 Programming in Parallel Environment

Assignment 5

Duplicate citizens
Apache Spark
Assigned:	4.5.2026
Deadline:	15.5.2026 23:59 (CEST)
ReCodEx:	assignment
Results:	Homogeneous cluster (max 8 nodes)

speedup	points
2× or less	0
2× to 3.5×	1
3.5× to 5×	2
5× to 7×	3
7× or more	4
Speedup baseline runs 180 s

The task is to count the number of people in a given population list with identical first and last names who live in the same region. The region can be identified by taking the highest-order digit of the ZIP code.

Use the Spark instance in /home/_teaching/para/spark (DO NOT COPY this directory). It is a self-contained Spark with Hadoop, so you do not need to install anything else. It can be started using the spark-slurm.sh sbatch script (to run the cluster under SLURM) or by local invocation (on one node), like this:

$> srun -p mpi-homo-short --mem=50G -c 4 /home/_teaching/para/spark/bin/spark-submit --master local[*] \
   ./your-solution.py /home/_teaching/para/05-spark/data/debug/small.csv

(adjust the memory and number of cores for your task accordingly).

The data and the initial scripts are available in /home/_teaching/para/05-spark. There are multiple datasets that have several GiB, so please DO NOT COPY them. The measurements will use the large.csv dataset only; the other datasets in the debug subdir are for debugging and additional experimentation. The path to the input data file is given as the only command-line argument to your solution.

The input file has 4 columns; the first two are the first and last name, respectively, and the last column is the area (ZIP) code.

Output

The result must be stored in the ./output.csv file in the current working directory. As the name suggests, the file is in CSV format without a header, where the columns are separated by a comma (,) and the lines are separated by a simple newline (LF). The first column is the region number (i.e. the highest number from the postcode). The second column is the number of collisions (duplicates found) in that region. The file must be sorted by the region number in ascending order.

Example

Let us have the following input file:

Jan,Novák,269235752,12345
Jakub,Yaghob,996583646,12345
Jan,Novák,786940673,28167
Vladan,Majerech,302273606,29876
Jan,Novák,217963156,13456
Jakub,Yaghob,778990574,12345
Jan,Novák,804096211,28002
Jan,Novák,252349303,15678

Upon closer inspection, there are 2 areas (ZIP groups), starting with 1 and 2. In the first area, there are 3 duplicates of Jan Novák and 2 duplicates of Jakub Yaghob (i.e., 5 people in total). In the second area, there are only two people named Jan Novák. So the output file should look like this (the first column is the area number, and the second is the number of duplicates):

1,5
2,2

Submission

Submit your solution to ReCodEx. You are expected to submit a single Python script (with the .py extension). For practical reasons, ReCodEx uses only the tiny.csv input dataset to perform a smoke test of your solution. Furthermore, the environment in ReCodEx uses only spark-submit --master local[1] to evaluate the solutions, which works slightly differently from the spark-slurm.sh script. Most notably, it executes only a single worker.

You are expected to perform your own thorough testing using the large.csv dataset on the entire parlab cluster.

Final evaluation

Your solution will be tested using the spark-slurm.sh script with identical #SBATCH parameters (i.e., using 8 nodes and 16 workers). We are using a slight modification of the script, which saves the measured time in a file rather than printing it out, but that should not affect the measurements. You can easily test your solutions as follows:

$> sbatch /home/_teaching/para/05-spark/spark-slurm.sh ./your-solution.py /home/_teaching/para/05-spark/data/large.csv

(use datasets in debug subdir for debugging and additional testing).

The baseline is a reference solution executed on a single node with a single worker. Since the measured times have larger variance in this assignment, we performed several evaluations and set the baseline time to 180 s (as a rounded average of multiple runs after cutting off outliers).