>> ANATOMY OF A PARALLEL APPLICATION (MPI BASICS)
It is one thing to just run an application; but another to understand how it
works internally.
In an
earlier blog
we demonstrated the execution of the Pokémon GO brute-force
hash generator algorithm running as a parallel application on our Raspberry Pi
cluster. While we were quick to jump into seeing the performance of our
cluster; one thing we overlooked was explaining how parallel applications
are written and walk through the code to understand how it works.
Up until a few days ago; the only remnants of parallel programming I had
in my head was dating way back to my university years when we had one course
on the subject and we had access to a 64-CPU 1Mhz computer offering
vector processing (run the same command across multiple nodes). The first
step was to study the tutorials over at
mpitutorial.com
to get acquainted with MPI.
Thankfully; I am a quick learner - here are my findings and how I designed
the parallel application.
DECLARATIONS AND CONSTANTS
First of all; we need to define that we wish to use the MPI library -
a simple #include does this.
#include <mpi.h>
#define NODE_MASTER 0
#define MIN_NODES 2
#define MAX_NODES 16
#define TAG_SEED 0x01 // master to slave - are you alive?
#define TAG_WORK 0x02 // master to slave - here is a job for you
#define TAG_DIE 0x03 // master to slave - ok, you can shutdown
#define TAG_RESULT 0x04 // slave to master - here is the result
We also need to define a few constants to help with the flow of the
application. MPI messaging allows for the tagging of messages, which
are great for logic decisions; we have created a few in our case -
the first to seed the slaves; the second to do some work and a third
to tell the slaves they can shutdown. We also need another to send results
from the slave back to the master.
MPI INITIALIZATION
MPI is more than just a library; there is a lot that happens behind
the scenes in regards to inter process communication between different
nodes (utilizing ssh) and the spawning of the various processes.
The good news is that initialization is a one liner; we then need to
identify our position in the MPI space and determine what type of node
we need to act as.
int rank, size;
char name[MPI_MAX_PROCESSOR_NAME];
int name_len;
// initialize MPI
MPI_Init(&argc, &argv);
// get the environment information from MPI
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// get the name of the worker
MPI_Get_processor_name(name, &name_len);
fprintf(stdout, ":: '%s' - process %d of %d\n", name, rank, size);
// do we have enough processes: we need at least two nodes
if ((size < MIN_NODES) || (size > MAX_NODES)) exit(0);
At this point, we have two variables rank and size.
While it is common practice (as we will do) to make the master the node
that has the rank of 0 - it doesn't have to be this way. You
may wish to utilize a specific machine for the master and you could
check the processor name to do this.
switch (rank)
{
// MASTER
case NODE_MASTER:
// put the master logic here
break;
// SLAVE
default:
// put the slave logic here
break;
}
Based on the rank variable - we switch the application into
acting as a master or slave. MPI deploys the same application across
all the nodes where it spawns the process, so we need to manage this
check ourselves and act appropriately, a simple switch or
if-then-else statement will be sufficient to separate the
logic between being a master and a slave.
LOGIC: MASTER NODE
The master node is the brains of the application as it is responsible for
issuing jobs to all the slave nodes until a result it found. The master
will issue three types of messages to the slaves - the first to see if
the slave is alive; the second to issue a job to perform and the last
to shutdown the slave.
int node, job, result;
// seed the slaves
for (node=1; node<size; node++)
MPI_Send(NULL, 0, MPI_INT, node, TAG_SEED, MPI_COMM_WORLD);
// keep passing out tasks to slaves until a solution is found
for (;;)
{
// let's receive the status of the slave computation
MPI_Recv(&result, 1, MPI_INT,
MPI_ANY_SOURCE, TAG_RESULT, MPI_COMM_WORLD, &status);
if (result == 1) break;
// if we didn't get a result, issue a new tasks to the same slave
node = status.MPI_SOURCE;
MPI_Send(&job, 1, MPI_INT, node, TAG_WORK, MPI_COMM_WORLD);
}
// inform all slaves to shutdown
for (node=1; node<size; node++)
MPI_Send(NULL, 0, MPI_INT, node, TAG_DIE, MPI_COMM_WORLD);
Seeing the slaves isn't technically required; but it does make the
flow of main processing loop simpler. The basic idea of the master
is to wait until a response is received by a slave and then issue
the slave that responded with a new task. This ensures maximum
use of all the slaves, the last thing we want is slaves waiting for
their turn - as I can try to explain below with two slaves:
{node} {execution time}
----------------------------------------------------------
slave 1: xxxxxxxxxx
slave 2: xxxxxx
slave 2: xxxxxxxxxx
slave 1: xxxxxxxxxxx
salve 2: xxxxxxxxxxxxxxxxxxx
slave 1: xxxxxxxxxxx
slave 1: xxxxxxxxxxx
salve 2: xxxxxxxxxxxxxx
In the above example; we have two slave nodes executing in parallel. We
issue a new task to the one that responds first, as not all tasks may
take the same amount of time. If we were to assume linear assignment -
we would have a number of dead spots, for example (same data):
{node} {execution time}
----------------------------------------------------------
slave 1: xxxxxxxxxx
slave 2: xxxxxx....
slave 1: xxxxxxxxxxx
slave 2: xxxxxxxxx.
slave 1: xxxxxxxxxxx.......
salve 2: xxxxxxxxxxxxxxxxxxx
slave 1: xxxxxxxxx
salve 2: xxxxxxxxx
It should be clear here that points where a dot ('.') exist are
times where the slave will sit idle; waiting for its turn to be issued a
new command, it is great that we know the id of the node that responded
so we can avoid this type of dead time in the parallel execution process.
LOGIC: SLAVE NODE
The slave node just needs to execute the job it has been asked to do do
and send back a result.
// the slaves run until told to shutdown
for (;;)
{
MPI_Status status;
int job, result;
// receive the work request from the master
MPI_Recv(&job, 1, MPI_INT,
NODE_MASTER, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
// attempt to break the code
switch (status.MPI_TAG)
{
case TAG_SEED:
result = 0; // no result, let the master know we are here
break;
case TAG_WORK:
result = /** do the job, based on 'job' params passed **/
break;
case TAG_DIE:
goto SLAVE_DONE;
}
// send the result of the work request back to the master
MPI_Send(&result, 1, MPI_INT, NODE_MASTER, TAG_RESULT, MPI_COMM_WORLD);
}
SLAVE_DONE:;
The core logic is to run in an infinite loop until a TAG_WORK
or TAG_DIE message is received from the master. 99.9% of the
time, the slave will be working way doing some form of complex processing
based on the parameters that have been sent from the master. When it
received the TAG_DIE message, it should break from the infinite
loop and shutdown nicely.
MPI TERMINATION
The final step of the parallel application is to tell the MPI environment
that it can shut itself down.
// shutdown MPI
MPI_Finalize();
In turn, this will clean up all the resources that were utilized to run the
application in parallel across the various nodes that were defined. It will
also bring down any slaves that may not have shutdown correctly and collate
any error messages that may have incurred as a result of the shutdown.
Hopefully; with this knowledge available - you can go back to the
previous blog entry and download the source code and get a better
understanding of how the single process was parallelized using MPI, in
the mpi-sources.zip I have included a number of versions, to
do some time comparisons.
>> max=50/min=12 split=4_4 >> max=50/min=12 split=5_3
mpi_call: 30958 mpi_call: 323521
attempts: 1435441080 attempts: 1428129840
real 2m24.406s real 2m59.222s
If we run the numbers, we can approximate that the overhead of the
MPI messaging framework compared to the time it takes to calculate a
hash lets us know that we can do approximately 1200 hash calculations
in the same time a single MPI request is made. It makes sense to
identify this ratio to ensure you can maximize the benefit of making
the application parallel in the first place.