Aaron Ardiri - IoT Blog: Parallel applications on a Raspberry Pi cluster

RSS/XML feed
198 entries available (show all)


PLEASE TAKE A MOMENT TO FOLLOW MY NEW VENTURE: RIoT Secure AB ALL SECURITY RELATED TOPICS ON IoT wILL BE POSTED THERE

2016-11-29	>> PARALLEL APPLICATIONS ON A RASPBERRY PI CLUSTER "Too many chefs in the kitchen"? - the same applies to parallel programming if not done right. Now that we have our Raspberry Pi cluster all setup - we have to put it to use, doing more than boring "Hello World" examples that seem to be the only thing other tutorials offer. We should apply a real world problem and see how the cluster performs; such as brute-forcing the Pokémon GO magic numbers that we did in a previous entry. In order to make comparisons on the efficiency of the execution time; we a benchmark reference. For this; we will run the application as a single core process running on the master node of the Raspberry PI - a few tweaks were required to the code (integrating the `uint128_t` library) as the platform does not support the Tetra Integer machine mode as it only has a 32bit CPU. In order to run parallel applications on the device; we must make sure that there is password less `ssh` access in both directions between the master node and all slave nodes. MPI also requires that the application reside on the same location on all nodes; hence we will utilize the `/mnt/nfs/` share to execute our application - this will save having to copy files across devices every time. Running our benchmark reference, we get: $ cd /mnt/nfs/brute-force/single $ make $ time ./pogohash-brute :: 13 12 17 16 41 42 47 48 ROUND_MAGIC_HI = 0xe3f0d44988bcdfab ROUND_MAGIC_LO = 0x081570afdd535ec3 FINAL_MAGIC0 = 0xce7c4801d683e824 FINAL_MAGIC1 = 0x6823775b1daad522 attempts: 29378208 done real 3m7.600s user 3m7.560s sys 0m0.010s A timing of 3 minutes and 7.6 seconds, the time required to perform over 29 million hashes. In order to start writing parallel applications for our Raspberry Pi cluster; we need to install the MPI (Message Passing Interface) libraries on all of our nodes. A native package exists on the Raspberry Pi, so we simply need to use the platform tools to install them. $ sudo apt-get install mpich After reviewing the MPI library documentation and working through the basic tutorials, I managed to build a parallel version of the brute-force algorithm to find the Pokémon GO magic numbers. After defining the appropriate host files, executing the application on sixteen cores resulted in: $ cd /mnt/nfs/brute-force/mpi $ make $ time mpiexec -machinefile HOSTS -np 16 ./main :: 13 12 17 16 41 42 47 48 ROUND_MAGIC_HI = 0xe3f0d44988bcdfab ROUND_MAGIC_LO = 0x081570afdd535ec3 FINAL_MAGIC0 = 0xce7c4801d683e824 FINAL_MAGIC1 = 0x6823775b1daad522 mpi_call: 29378217 done real 3m12.324s user 8m10.040s sys 4m19.210s Wait... it took longer than the single core version? That definitely was not expected - but, of course there is an explanation as to why. The bottom line is that while each core may have generated fifteen times less hashes (master doesn't), we must also take into consideration the overhead of the MPI interface implementation. The hash function itself only takes a few milliseconds to execute; if the MPI overhead is the same - we really would not see any improvement in performance. In this MPI application we are using the MPI interface over 29 million times, we definitely need to reduce this if to improve performance. A few quick modifications to the program and we get: $ cd /mnt/nfs/brute-force/mpi-optimized $ make $ mpiexec -machinefile HOSTS -np 16 ./main :: 13 12 17 16 41 42 47 48 ROUND_MAGIC_HI = 0xe3f0d44988bcdfab ROUND_MAGIC_LO = 0x081570afdd535ec3 FINAL_MAGIC0 = 0xce7c4801d683e824 FINAL_MAGIC1 = 0x6823775b1daad522 mpi_call: 169 done real 0m3.744s user 0m12.910s sys 0m0.540s Now; that is more like it - just under four seconds! The original brute-force algorithm had eight nested loops - by splitting the work between the master and the client nodes; we were able to reduce the number of MPI interface uses down to only 169. However, we calculated over 36 million hashes across the sixteen different CPU cores. So how do the different versions compare? -- max: 48, min: 12 -- \| -- max: 49, min: 12 -- single: 187.670 seconds \| single: 3923.156 seconds multi: 3.744 seconds // 50.12x \| multi: 86.195 seconds // 45.51x By distributing the workload across sixteen different CPU cores the magic numbers were able to be found fifty times faster than using a single CPU core - definitely a noteworthy improvement that demonstrates the power of running parallel applications. Since we knew the `min` and `max` values exactly; we may have tripped an early exit condition early - to get more accurate performance factors we would need to increase the scanning range (on-going tests, results to be updated). There are theoretical speed-up laws proposed, namely Amdahl's law and Gustafson's law that propose the maximum speed of from using sixteen processed would be ten times faster than running with a single process. With larger ranges; it would be natural to see such a scaling factor - as we can see with revised timings above with new `min` and `max` values. The full source code for the single core and different MPI versions are available: mpi-sources.zip
advertisement (self plug): need assistance in an IoT project? contact us for a free consultation.
comments powered by Disqus

Anatomy of a parallel application (MPI basics)

Building a Raspberry Pi 3 cluster (part 3)

DISCLAIMER:
All content provided on this blog is for informational purposes only.
All comments are generated by users and moderated for inappropriateness periodically.
The owner will not be liable for any losses, injuries, or damages from the display or use of this information.