www.neuron.yale.edu

Posted: **Mon Apr 07, 2014 12:46 pm**

Hi all.

I implemented (with other co-authors) the neural network model published on ModelDB (http://senselab.med.yale.edu/ModelDB/Sh ... del=151126). We implemented a parallel version of this model.

During performance tests, I observed some bad performance.

In particular, I observed that when the cores used for the simulation are on the same node, the model well-scale (i.e: 8 cores on the same node, that has an 8-core processor).

When the cores used for the simulation belong to different nodes (i.e. 16 cores, 8 on the first node, 8 on the second one), the performance are worse than the case with 8 cores.

From the paper "Parallel Network Simulations with NEURON" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655137/), I observed that:

"A distant second in terms of likely benefit for performance optimization is the arrangement of gids that minimizes the number of spikes that have to be communicated (i.e. cell groups that are densely interconnected should as much as possible be on the same machine)."

So, the communications are the problem!

I downloaded the Listing 5 (Parallel implementation of the network with random connnectivity) reported in the paper "Translating network models to parallel hardware in NEURON", and I modified the source code in order to obtain a neural network with 64 cells:

cells with "ID" between 0 and 30 are connected with 31 other cells (with "ID" between 0 and 31, except itself);
cell with "ID" = 31 is connected with 32 cells, each of which with "ID" between 0 and 32, except itself;
cell with "ID" = 32 is connected with 32 cells, each of which with "ID" between 31 and 63, except itself;
cells with "ID" between 33 and 63 are connected with 31 other cells (with "ID" between 32 and 63, except itself);

In this way, I created two parts of the network densely interconnected, and interconnected by means only 2 cells (31 and 32).

At this point a created two version of this network: the first one (called "network A") with a round robin assignment of the cells; the second one (called "network B") with the following schema:

cells with "ID" between 0 and 31 on the 8 cores of the first node, with a round robin strategy;
cells with "ID" between 32 and 63 on the 8 cores of the second node, with a round robin strategy.

Finally, I executed the "network A" and the "network B" with 8 and 16 cores.

I observed that the "network B" with 16 cores is more performant than the "network A" with 16 cores.

But, the "network B" executed with 8 cores is more performant than the "network B" executed with 16 cores.

Observing the performance, the "step_time" scales with the number of cores, but the "wait_time" grows drastically and it compromises the performance.

This decrease in performance may be due to incorrect setting of MPI?

Posted: **Mon Apr 07, 2014 1:45 pm**

Ah, the imprecision of human language. Could you please help me understand this sentence:

jackfolla wrote:I implemented (with other co-authors) the neural network model . . .

Am I to understand that
(1) you were one of the people who developed that model
or is it
(2) you were not one of the developers of the model, and you downloaded the zip file and tried the parallel implementation
?

During performance tests, I observed some bad performance.
In particular, I observed that when the cores used for the simulation are on the same node, the model well-scale (i.e: 8 cores on the same node, that has an 8-core processor).
When the cores used for the simulation belong to different nodes (i.e. 16 cores, 8 on the first node, 8 on the second one), the performance are worse than the case with 8 cores.

How do you know that poor load balance wasn't the problem?

From the paper "Parallel Network Simulations with NEURON" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655137/), I observed that:
"A distant second in terms of likely benefit for performance optimization is the arrangement of gids that minimizes the number of spikes that have to be communicated (i.e. cell groups that are densely interconnected should as much as possible be on the same machine)."
So, the communications are the problem!

Not so. The phrase
"cell groups that are densely interconnected should as much as possible be on the same machine"
is merely a statement of how to try to build a network that "minimizes the number of spikes that have to be communicated". The whole sentence "A distant second . . . machine)" should be interpreted to mean that, in most cases, interprocessor communication accounts for only a tiny part of run time. When run times do not scale well with the number of processors, the problem is usally poor load balance.

Observing the performance, the "step_time" scales with the number of cores, but the "wait_time" grows drastically and it compromises the performance.

Wait time is caused by load imbalance--some processors have less work to do than others, so they finish early and have to wait for the heavily-loaded processors to catch up before spike exchange can be done. If increasing the number of cores makes wait time increase, that's a sign that load imbalance is getting worse. This happens when a network is "too small" for the number of processors on which you are trying to run it--some processors don't have enough work to do to keep them busy between spike exchanges, while other processors have more than enough.

Posted: **Thu Apr 10, 2014 3:41 am**

Dear Ted,
thanks for the quick reply.

Regarding your question, the answer is the (1), I am one of the people who developed that model.

I'm sorry for my imprecision. ;)

Regarding the load imbalance, I will test the network with more neurons.

www.neuron.yale.edu

Parallel neural network (bad) performance

Parallel neural network (bad) performance

Re: Parallel neural network (bad) performance

Re: Parallel neural network (bad) performance