>> PARALLEL APPLICATIONS ON A RASPBERRY PI CLUSTER
"Too many chefs in the kitchen"?
- the same applies to parallel programming if not done right.
Now that we have our Raspberry Pi
cluster
all setup - we have to put it to use, doing more than boring "Hello World"
examples that seem to be the only thing other tutorials offer. We should apply
a real world problem and see how the cluster performs; such as
brute-forcing
the Pokémon GO magic numbers that we did in a previous entry.
In order to make comparisons on the efficiency of the execution time; we
a benchmark reference. For this; we will run the application as a single
core process running on the master node of the Raspberry PI - a few tweaks
were required to the code (integrating the
uint128_t library)
as the platform does not support the
Tetra Integer
machine mode as it only has a 32bit CPU.
In order to run parallel applications on the device; we must make sure that
there is password less ssh access in both directions between the
master node and all slave nodes. MPI also requires that the application reside
on the same location on all nodes; hence we will utilize the /mnt/nfs/
share to execute our application - this will save having to copy files across
devices every time.
Running our benchmark reference, we get:
$ cd /mnt/nfs/brute-force/single
$ make
$ time ./pogohash-brute
:: 13 12 17 16 41 42 47 48
ROUND_MAGIC_HI = 0xe3f0d44988bcdfab
ROUND_MAGIC_LO = 0x081570afdd535ec3
FINAL_MAGIC0 = 0xce7c4801d683e824
FINAL_MAGIC1 = 0x6823775b1daad522
attempts: 29378208
done
real 3m7.600s
user 3m7.560s
sys 0m0.010s
A timing of 3 minutes and 7.6 seconds, the time required to perform over
29 million hashes.
In order to start writing parallel applications for our Raspberry Pi
cluster; we need to install the
MPI
(Message Passing Interface) libraries on all of our nodes. A native package
exists on the Raspberry Pi, so we simply need to use the platform tools
to install them.
$ sudo apt-get install mpich
After reviewing the MPI library documentation and working through the basic
tutorials, I managed to build a parallel version of the brute-force algorithm
to find the Pokémon GO magic numbers. After defining the appropriate
host files, executing the application on sixteen cores resulted in:
$ cd /mnt/nfs/brute-force/mpi
$ make
$ time mpiexec -machinefile HOSTS -np 16 ./main
:: 13 12 17 16 41 42 47 48
ROUND_MAGIC_HI = 0xe3f0d44988bcdfab
ROUND_MAGIC_LO = 0x081570afdd535ec3
FINAL_MAGIC0 = 0xce7c4801d683e824
FINAL_MAGIC1 = 0x6823775b1daad522
mpi_call: 29378217
done
real 3m12.324s
user 8m10.040s
sys 4m19.210s
Wait... it took longer than the single core version?
That definitely was not expected - but, of course there is an explanation
as to why. The bottom line is that while each core may have generated
fifteen times less hashes (master doesn't), we must also take into
consideration the overhead of the MPI interface implementation.
The hash function itself only takes a few milliseconds to execute; if the
MPI overhead is the same - we really would not see any improvement in
performance. In this MPI application we are using the MPI interface over
29 million times, we definitely need to reduce this if to improve performance.
A few quick modifications to the program and we get:
$ cd /mnt/nfs/brute-force/mpi-optimized
$ make
$ mpiexec -machinefile HOSTS -np 16 ./main
:: 13 12 17 16 41 42 47 48
ROUND_MAGIC_HI = 0xe3f0d44988bcdfab
ROUND_MAGIC_LO = 0x081570afdd535ec3
FINAL_MAGIC0 = 0xce7c4801d683e824
FINAL_MAGIC1 = 0x6823775b1daad522
mpi_call: 169
done
real 0m3.744s
user 0m12.910s
sys 0m0.540s
Now; that is more like it - just under four seconds!
The original brute-force algorithm had eight nested loops - by splitting
the work between the master and the client nodes; we were able to reduce
the number of MPI interface uses down to only 169. However, we
calculated over 36 million hashes across the sixteen different CPU cores.
So how do the different versions compare?
-- max: 48, min: 12 -- | -- max: 49, min: 12 --
single: 187.670 seconds | single: 3923.156 seconds
multi: 3.744 seconds // 50.12x | multi: 86.195 seconds // 45.51x
By distributing the workload across sixteen different CPU cores the magic
numbers were able to be found fifty times faster than using a single CPU core
- definitely a noteworthy improvement that demonstrates the power of
running parallel applications. Since we knew the min and max
values exactly; we may have tripped an early exit condition early - to get
more accurate performance factors we would need to increase the scanning range
(on-going tests, results to be updated).
There are theoretical speed-up laws proposed, namely
Amdahl's law and
Gustafson's law
that propose the maximum speed of from using sixteen processed would be ten
times faster than running with a single process. With larger ranges; it would
be natural to see such a scaling factor - as we can see with revised timings
above with new min and max values.
The full source code for the single core and different MPI versions are
available: