Parallel neural network (bad) performance
Posted: Mon Apr 07, 2014 12:46 pm
Hi all.
I implemented (with other co-authors) the neural network model published on ModelDB (http://senselab.med.yale.edu/ModelDB/Sh ... del=151126). We implemented a parallel version of this model.
During performance tests, I observed some bad performance.
In particular, I observed that when the cores used for the simulation are on the same node, the model well-scale (i.e: 8 cores on the same node, that has an 8-core processor).
When the cores used for the simulation belong to different nodes (i.e. 16 cores, 8 on the first node, 8 on the second one), the performance are worse than the case with 8 cores.
From the paper "Parallel Network Simulations with NEURON" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655137/), I observed that:
I downloaded the Listing 5 (Parallel implementation of the network with random connnectivity) reported in the paper "Translating network models to parallel hardware in NEURON", and I modified the source code in order to obtain a neural network with 64 cells:
cells with "ID" between 0 and 30 are connected with 31 other cells (with "ID" between 0 and 31, except itself);
cell with "ID" = 31 is connected with 32 cells, each of which with "ID" between 0 and 32, except itself;
cell with "ID" = 32 is connected with 32 cells, each of which with "ID" between 31 and 63, except itself;
cells with "ID" between 33 and 63 are connected with 31 other cells (with "ID" between 32 and 63, except itself);
In this way, I created two parts of the network densely interconnected, and interconnected by means only 2 cells (31 and 32).
At this point a created two version of this network: the first one (called "network A") with a round robin assignment of the cells; the second one (called "network B") with the following schema:
cells with "ID" between 0 and 31 on the 8 cores of the first node, with a round robin strategy;
cells with "ID" between 32 and 63 on the 8 cores of the second node, with a round robin strategy.
Finally, I executed the "network A" and the "network B" with 8 and 16 cores.
I observed that the "network B" with 16 cores is more performant than the "network A" with 16 cores.
But, the "network B" executed with 8 cores is more performant than the "network B" executed with 16 cores.
Observing the performance, the "step_time" scales with the number of cores, but the "wait_time" grows drastically and it compromises the performance.
This decrease in performance may be due to incorrect setting of MPI?
I implemented (with other co-authors) the neural network model published on ModelDB (http://senselab.med.yale.edu/ModelDB/Sh ... del=151126). We implemented a parallel version of this model.
During performance tests, I observed some bad performance.
In particular, I observed that when the cores used for the simulation are on the same node, the model well-scale (i.e: 8 cores on the same node, that has an 8-core processor).
When the cores used for the simulation belong to different nodes (i.e. 16 cores, 8 on the first node, 8 on the second one), the performance are worse than the case with 8 cores.
From the paper "Parallel Network Simulations with NEURON" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655137/), I observed that:
So, the communications are the problem!"A distant second in terms of likely benefit for performance optimization is the arrangement of gids that minimizes the number of spikes that have to be communicated (i.e. cell groups that are densely interconnected should as much as possible be on the same machine)."
I downloaded the Listing 5 (Parallel implementation of the network with random connnectivity) reported in the paper "Translating network models to parallel hardware in NEURON", and I modified the source code in order to obtain a neural network with 64 cells:
cells with "ID" between 0 and 30 are connected with 31 other cells (with "ID" between 0 and 31, except itself);
cell with "ID" = 31 is connected with 32 cells, each of which with "ID" between 0 and 32, except itself;
cell with "ID" = 32 is connected with 32 cells, each of which with "ID" between 31 and 63, except itself;
cells with "ID" between 33 and 63 are connected with 31 other cells (with "ID" between 32 and 63, except itself);
In this way, I created two parts of the network densely interconnected, and interconnected by means only 2 cells (31 and 32).
At this point a created two version of this network: the first one (called "network A") with a round robin assignment of the cells; the second one (called "network B") with the following schema:
cells with "ID" between 0 and 31 on the 8 cores of the first node, with a round robin strategy;
cells with "ID" between 32 and 63 on the 8 cores of the second node, with a round robin strategy.
Finally, I executed the "network A" and the "network B" with 8 and 16 cores.
I observed that the "network B" with 16 cores is more performant than the "network A" with 16 cores.
But, the "network B" executed with 8 cores is more performant than the "network B" executed with 16 cores.
Observing the performance, the "step_time" scales with the number of cores, but the "wait_time" grows drastically and it compromises the performance.
This decrease in performance may be due to incorrect setting of MPI?