Aaron Ardiri
[Valid RSS] RSS/XML feed
198 entries available (show all)

 

Internet of Things (IoT)
   

PLEASE TAKE A MOMENT TO FOLLOW MY NEW VENTURE:
 
RIoT Secure AB
 
ALL SECURITY RELATED TOPICS ON IoT wILL BE POSTED THERE


2016-11-29
>> PARALLEL APPLICATIONS ON A RASPBERRY PI CLUSTER

"Too many chefs in the kitchen"? - the same applies to parallel programming if not done right.

Now that we have our Raspberry Pi cluster all setup - we have to put it to use, doing more than boring "Hello World" examples that seem to be the only thing other tutorials offer. We should apply a real world problem and see how the cluster performs; such as brute-forcing the Pokémon GO magic numbers that we did in a previous entry.

In order to make comparisons on the efficiency of the execution time; we a benchmark reference. For this; we will run the application as a single core process running on the master node of the Raspberry PI - a few tweaks were required to the code (integrating the uint128_t library) as the platform does not support the Tetra Integer machine mode as it only has a 32bit CPU.

In order to run parallel applications on the device; we must make sure that there is password less ssh access in both directions between the master node and all slave nodes. MPI also requires that the application reside on the same location on all nodes; hence we will utilize the /mnt/nfs/ share to execute our application - this will save having to copy files across devices every time.

Running our benchmark reference, we get:

    $ cd /mnt/nfs/brute-force/single
    $ make
    $ time ./pogohash-brute
    :: 13 12 17 16 41 42 47 48
    ROUND_MAGIC_HI = 0xe3f0d44988bcdfab
    ROUND_MAGIC_LO = 0x081570afdd535ec3
    FINAL_MAGIC0   = 0xce7c4801d683e824
    FINAL_MAGIC1   = 0x6823775b1daad522
    attempts: 29378208
    done
    
    real    3m7.600s
    user    3m7.560s
    sys     0m0.010s

A timing of 3 minutes and 7.6 seconds, the time required to perform over 29 million hashes.

In order to start writing parallel applications for our Raspberry Pi cluster; we need to install the MPI (Message Passing Interface) libraries on all of our nodes. A native package exists on the Raspberry Pi, so we simply need to use the platform tools to install them.

    $ sudo apt-get install mpich

After reviewing the MPI library documentation and working through the basic tutorials, I managed to build a parallel version of the brute-force algorithm to find the Pokémon GO magic numbers. After defining the appropriate host files, executing the application on sixteen cores resulted in:

    $ cd /mnt/nfs/brute-force/mpi
    $ make
    $ time mpiexec -machinefile HOSTS -np 16 ./main
    :: 13 12 17 16 41 42 47 48
    ROUND_MAGIC_HI = 0xe3f0d44988bcdfab
    ROUND_MAGIC_LO = 0x081570afdd535ec3
    FINAL_MAGIC0   = 0xce7c4801d683e824
    FINAL_MAGIC1   = 0x6823775b1daad522
    mpi_call: 29378217
    done
    
    real    3m12.324s
    user    8m10.040s
    sys     4m19.210s

Wait... it took longer than the single core version?

That definitely was not expected - but, of course there is an explanation as to why. The bottom line is that while each core may have generated fifteen times less hashes (master doesn't), we must also take into consideration the overhead of the MPI interface implementation.

The hash function itself only takes a few milliseconds to execute; if the MPI overhead is the same - we really would not see any improvement in performance. In this MPI application we are using the MPI interface over 29 million times, we definitely need to reduce this if to improve performance.

A few quick modifications to the program and we get:

    $ cd /mnt/nfs/brute-force/mpi-optimized
    $ make
    $ mpiexec -machinefile HOSTS -np 16 ./main
    :: 13 12 17 16 41 42 47 48
    ROUND_MAGIC_HI = 0xe3f0d44988bcdfab
    ROUND_MAGIC_LO = 0x081570afdd535ec3
    FINAL_MAGIC0   = 0xce7c4801d683e824
    FINAL_MAGIC1   = 0x6823775b1daad522
    mpi_call: 169
    done
    
    real    0m3.744s
    user    0m12.910s
    sys     0m0.540s

Now; that is more like it - just under four seconds!

The original brute-force algorithm had eight nested loops - by splitting the work between the master and the client nodes; we were able to reduce the number of MPI interface uses down to only 169. However, we calculated over 36 million hashes across the sixteen different CPU cores.

So how do the different versions compare?

     -- max: 48, min: 12 --            |  -- max: 49, min: 12 --
    single:  187.670 seconds           | single: 3923.156 seconds
    multi:     3.744 seconds // 50.12x | multi:    86.195 seconds // 45.51x

By distributing the workload across sixteen different CPU cores the magic numbers were able to be found fifty times faster than using a single CPU core - definitely a noteworthy improvement that demonstrates the power of running parallel applications. Since we knew the min and max values exactly; we may have tripped an early exit condition early - to get more accurate performance factors we would need to increase the scanning range (on-going tests, results to be updated).

There are theoretical speed-up laws proposed, namely Amdahl's law and Gustafson's law that propose the maximum speed of from using sixteen processed would be ten times faster than running with a single process. With larger ranges; it would be natural to see such a scaling factor - as we can see with revised timings above with new min and max values.

The full source code for the single core and different MPI versions are available:


 

advertisement (self plug):
need assistance in an IoT project? contact us for a free consultation.

 



Anatomy of a parallel application (MPI basics)
 
Building a Raspberry Pi 3 cluster (part 3)

DISCLAIMER:
All content provided on this blog is for informational purposes only.
All comments are generated by users and moderated for inappropriateness periodically.
The owner will not be liable for any losses, injuries, or damages from the display or use of this information.