Parallel NEURON/H'ware - batch of individual runs

Malbolge · Post by **Malbolge** » Wed Mar 03, 2010 3:44 pm

Hi...

First I'll describe the problem, or set of simulations that I'm looking to run, and the I'll outline the approach I am considering... but I'm a little unsure whether it is the preferred or desirable approach. I'm largely new to parallel NEURON (and parallel computing), but have been using serial NEURON for about 15 months or so, and have a reasonable grounding in OO programming.

The field I'm looking at is the effects of Noise and Heterogeneity on the response of MVN neurons. Currently I'm just looking at individual Neurons (we collect simulation data from X individual runs, using the same input and experimental conditions, but different initial voltages and different random noise added to the input, and consider that a population response). I have HOC code for running these simulations serially... My RUN.HOC loads some globals, an MVN neuron and an analysis.HOC (for recording spike times) and adds a point process for providing input (a sin wave current of a given frequency and amplitude)... it then runs X number of simulations, one after the other. Each simulation uses analysis.HOC to record spike times for that run to a vector, which (after each run) are appended to a list of vectors. After all the simulations are run, the spike time list is written to a file for later analysis... so I end with a .txt file with X vectors of spike times. However, the number of individual runs I'm looking to perform number in the 10,000's (with different experimental models, volumes of noise, input frequencies/amplitudes etc), and so I'm looking to utilise a ~60 node UNIX cluster to improve the time in which I can collect all the data.

The approach I am considering is to use the cluster to run the above serial sequence, but on X hosts. That is, I would have a main.HOC, which, given a number of processors (say 50) would then make those processors/hosts run through a similar sequence to the RUN.HOC described above (they'll each create their own analysis.HOC, an MVN Neuron, attach the point process and run Y simulations, storing each simulations spike times in a list of vectors). After that, the main.HOC would then collect each hosts spike time list(each with Y vectors, one for each run) and append them all together to give a list of X*Y vectors, which would be written to a file. My population size currently is 500, so I would hope to use 50 hosts, each running 10 simulations (one after the other).

Is this a worthwhile approach to take? Is it even possible? From what I've read on parallel NEURON I assume it is possible... by templating my MVN.HOC, analysis.HOC, RUN.HOC (anything for which there will be multiple instances of it created), and having the main.HOC create a new RUN.HOC on each host (which then creates its own MVN.HOC and analysis.HOC) each instance will be unique to that host. For example, the 'spikelist' list (holding the vectors of spike times for a given hosts runs) will be unique to that host (other hosts won't write spike times to that list), but can be collected/addressed by the main.HOC (using each hosts ID or similar).

Would there be any difficulties with this approach (if it is even possible)? I use 3 or 4 random streams, and am aware that each will have to be unique for a given host (with each host given a unique integer for seeding or producing a seed for their streams). If any further information is required, I'd be happy to supply it.

Thanks in advance for any help or advice that can be given. My apologies if this is a very obvious and apparent question.

James

Post by **ted** » Thu Mar 04, 2010 12:31 am

In principle, what you describe is certainly doable. It falls into the class of "embarrassingly parallel problems" for which NEURON has the tools to do the job. In case you have not already read the Programmer's Reference entry on ParallelContext, here it is:
http://www.neuron.yale.edu/neuron/stati ... arcon.html

The one thing that might slow you down is that the strategy you describe would have each processor pass a bunch of numbers back to the master. This is certainly doable with the submit() method
http://www.neuron.yale.edu/neuron/stati ... tml#submit
but the only way to determine whether this introduces intolerable overhead is to try it.

Malbolge · Post by **Malbolge** » Thu Mar 04, 2010 10:03 am

Hi Ted,

Thanks for the quick response. I figured the problem was Embarrassingly parallel, and I had come across the ParallelContext (during an abortive attempt to parallelise the simulations about 10 months ago). You've reassured me that it is worthwhile going with this design... I have done some distributed/parallel computing, but not much and none in NEURON, so was a little unsure if this approach would work. For the overhead of passing the numbers back to the master, if it is too much of an overhead, I do have a few ideas... but I'll look at them when the time comes.

Thanks again.

James.

Malbolge · Post by **Malbolge** » Wed Mar 17, 2010 3:42 pm

Hello Again...

Further to my initial post(s), I have a couple more enquiries. I'm still working on the same problem, as detailed above, but have realised that my mental model for how parallel NEURON works was a little off. I have since modified the design of my implementation, and just want to check that I'm still on the right track.

I've realised now that if I start "example.hoc" on the cluster I'm using (using something like "mpiexec -np 10 (or n) -wdir `pwd` nrniv -mpi example.hoc" ) it starts and runs the code in "example.hoc" on all n processors. With my old mental model, I thought it would start example.hoc just on the top node (making it a 'Master'), with x processors available, which could then be assigned slave.hocs. I've realised this was wrong... So, with my new mental model, I'm assuming I'll run A.hoc, which will run/execute on the top node AND all x processors. If A.hoc were something like this...

Code: Select all

Define Globals, Experimental variables, etc.
Initialise random streams, objects, lists etc.
load MVNB.hoc    //Neuron to be simulated.
Instrumentation - Attach stimulation point process, Set up Vector for recording spike times etc.

Define procedure to do 1 run - Runs through Simulation, records spike times in Vector. 

Define procedure to do batch of runs - Do 1 Run, n times. I.e. in a For loop:
for j = 0, numberOfRuns - 1{
do 1 Run
add Spike Time Vector to a Spike Time List.
}

I'm assuming I would end up with each processor, creating its own MVNB cell, and its own random Streams, objects, lists etc, and then running through multiple runs, and each storing its own List of Vectors. After this, could I use something like...

Code: Select all

if pc.id = 0 (I.e. the Master){
For each host{
Get spike time List, append onto FinalResults List
}
}

Essentially, using the Batrun{do SingleRun() x times} as I would in a serial setup, with the if pc.id = 0 {collect lists} added on at the end.

Or... Should I be using the submit function from parallelContext? I.e., Define a function to perform 1 run and return a Vector of Spike times, then submit that function X times (in a for loop)... so, the 1Run function will be executed X times, and I'll end up with X Vectors, using n processors, each run returning a Vector which is appended onto a list. I've never had any experience with Bulletin board style systems, and am only just coming to grips with submit & working etc. from parallelContext.

I have a feeling I should really be using the Submit function, rather than just running a pure Serial implementation on X processors, then having the 'master' collect spike lists... but, if I can do it that way, how exactly would I go about having the master collect the Spike lists from each processor?

Thanks, again, in advance for any help.

James.

Edit: I've had another thought... Would it be possible to keep the serial style implementation (each host does a batchRun of X singleRuns), then write a function (returnSpikesList) to return the spikeList. Then at the end, use submit to go through each Host, call its returnSpikesList function, and append the returned Lists onto allResultsList? That is, I could get the master to collate the SpikeLists at the end by using the submit method? Can I use the submit function to call a function from a specific Host, and return its unique SpikeList, and append each list (in order or not) into the allResultsList? Or, by using Submit, will the returnSpikesList functions be called on whichever Hosts answer the call (so 1 Host may have its function called more than once, and I'll end up with an incomplete set of spike times, with some duplicated)?

Malbolge · Post by **Malbolge** » Thu Mar 18, 2010 1:17 pm

I've been playing around with the ParallelContext submit, working etc. methods, and I think I have the hang of the Bulletin Board Style system. I've decided I'll use the BBS to submit and run all of the simulations, returning a vector of spike times each time, which the master Host will unpack and append to a list. Then, after all the tasks have completed, the master will do a little work on the List, and write it to a file. I assume this would be the more efficient method, as opposed to performing all the simulations then using the BBS submit to return/collate the spike times, because while the hosts are running simulations, the master can be collating the returned results... so simulations and result collection can be occurring concurrently, rather than it being simulations THEN collection happening in sequence.

I do have a couple questions however.

Firstly, when I submit tasks, they are functions... is it acceptable for a func to run a simulation? I.e could the submitted function be something like...

Code: Select all

func doRun() {
+ code for attaching Point Process, stim
stim.randNoiseSeed = $1
V_INIT = $2
stim.pacemakerAmp = $3  
finitialize(V_INIT)
	run()
+ code for recording spike times to rsVec.
pc.post(id, rsVec)
}

with the submit loop something along the lines of...

Code: Select all

for i = 0, x  {
  a = vInitRandomStream.repick
  b = pacemakerRandomStream.repick
  pc.submit("doRun", i, a, b)
}

My second question was about the difference between |while (pc.working)| and |while ((id = pc.working) != 0)| but I think I've figured that out for myself.

Thanks.

James.

Post by **hines** » Thu Mar 18, 2010 5:07 pm

It seems likely the bulletin board is the right choice for your problem.
The lower level MPI style is close to the machine and a principle requirement
for communication is synchronization of all processes. Also things have to
be mapped so that a knowledge of pc.id (and pc.nhost) is all that is necessary for
a process to carry out its tasks. By the time you get mpi working for your problem
all the code you write is likely to be reminiscent of a bulletin board. Remember
efficiency for mpi requires load balance.

I'd recommend not putting everything on the bulletin board at once. To keep the master
and workers at 100% it is merely necessary to keep the bulletin board todo list nonempty.
The idiiom is shown in nrn/src/parallel/test1.hoc where 10 jobs are submitted and
everytime the master retrieves a result, the master submits one more. Initially submitting 2*pc.nhost
of them is where I would start.

Remember that you cannot tell which process will execute which submitted item so there must be
enough info in each submit call for the process to discard any out of date aspects of the model and
run the new model and return the result (a complicated result can be returned using post, take, etc.).
The most typical usage is exactly the same model but a small set of different parameters. But in principle
a function can completely clear and recreate an entirely different model involving the reading of
files (but you cannot load the same templates multiple times). A function could write a file as long
as it had an identifying name that does not conflict with any other submit job's writing. I think it is
easier though to communicate through the bulletin board if there is enough memory.

Take some measurements of total run time with different numbers for mpiexec -n <nhost> ...
to verify that the performance scaling is good.

Malbolge · Post by **Malbolge** » Fri Mar 19, 2010 8:21 am

Hi...

Just trying to get a look at "nrn/src/parallel/test1.hoc", but I can't seem to get to it. The UNIX/Cluster installation I'm using, I don't think I have access to those directories (I'm a research PostGrad, the cluster belongs to the dept.)... also, I'm assuming the src/parallel folder isn't included in Windows installations, as it isn't in either nrn62 or nrn70. Would it be possible for someone to throw that code up here, just so I can get a look at the process for keeping the todo list not-empty? In the mean time, I'll try and approach our Sys Admin and see if he can help.

While I'm asking about this... What will be the effect of putting all my tasks on the Board at the same time, or adding them over time? I'll likely only be submitting 500, or maybe 5,000, tasks with any run of the code. Currently, I'm adding around 5,000 tasks (each task being to produce a vector of 60 random numbers, mirroring the type/volume of data I'll be collecting) and not seeing any errors or slowdown apparent to myself. These are alot simpler than the simulations I'll be running.

Also, on the subject of run time, I'm setting a variable (st) to pc.time just after the call of pc.runworker, and then calling print pc.time - st just before I close the parallelContext with pc.done. I'm assuming this will be a good measure of the run time after set up/all processors have the hoc code etc. However, I seem to actually get a lower value for pc.time-st for lower numbers of nhost. Running the code below with nhost = 6 I get times of around 7 seconds. Running with nhost = 60 gives me times around 12-14. Lowest for nhost = 6 was ~6 and highest ~12. For nhost = 60 lowest time was ~8 and highest ~14. I'll look into these a little more over the weekend, but could it be because, just now when I'm simply picking random numbers, the time benefits of spreading the tasks over more hosts is outweighed by the overheads of dealing with the larger number of hosts? So, if I increase the complexity of the tasks submitted I'll begin to see an improvement in the times for more hosts?

Code: Select all

// Test, for BB - Style systems using ParallelContext

{load_file("nrngui.hoc")}
 objref bbsRand
 objref rsVec
 objref pc
pc = new ParallelContext()
//st = pc.time

func bbsTest() {
  id = hoc_ac_
  rsVec = new Vector() 
  bbsRand = new Random($1)
  bbsRand.normal(0, 5)
  val = 0
  for i = 0, 60{
  val = bbsRand.repick()
  rsVec.append(val)
  //print val
  }
  pc.post(id, rsVec)
  //return rsVec
  return 0
}


myID = pc.id

objref res
pc.runworker()
st = pc.time
objref randVec
objref randList
objref vec
//randVec = new Vector()
randList = new List()

// If only 1 host, cannot run BBS, so exit, do nothing.
if (pc.nhost == 1) {
  print "serial Run... ending"

//if more than 1 host, BBS can be started, so submit Task y times. I.e. have the task performed y times. Pass the rank of the task, to be used for Ranom seeds etc.
}else{
for i = 0, 49999  {
  pc.submit("bbsTest", i+pc.nhost+49999 )
}


while ((id = pc.working) != 0){
  pc.take(id)
  vec = new Vector()
  vec = pc.upkvec
  randList.append(vec)
  print id
  //print pc.retval
  //randList.append(pc.retval)
  }
}


//Final steps, compile all results together and write to file. 
  res = new File()
	res.wopen("rsltshuge.txt")
	for c = 0, randList.count() - 1 {
	 for d = 0, randList.o[c].size - 1 {
  res.printf ("%f\t", randList.o[c].x[d])
  }
  res.printf ("\n")
}

  res.close()
  print pc.time - st
pc.done
quit()

Edit: Answered my above question about the run times. Increasing the complexity of the task leads to a faster run time with larger values of nhost.

Post by **hines** » Fri Mar 19, 2010 9:15 am

See the mercurial repository
http://www.neuron.yale.edu/hg
specifically
http://www.neuron.yale.edu/hg/neuron/nr ... /test1.hoc

What will be the effect of putting all my tasks on the Board at the same time

The searching for keys uses binary trees so has a complexity of order log(number of items).
You comment about not seeing slowdown is pertinent. The only thing that counts is the experimental
question of whether the easy way works just as well as strategems for saving space.

By the way, the bbs is supposed to work with one host the same way as with many. I'd hope there wouldn't be much
slowdown between a serial run and a run on one host using pc.submit and doing all the items itself.

Saving your code to temp.hoc and running on my desktop and avoiding the print "serial Run... ending" via
nrniv temp.hoc
it takes 11 seconds printing 50000 lines to the terminal and creating a 29025906 byte rsltshuge.txt
removing the printing and file writing reveals a computation time of 4.6 seconds and running that (I have 4 cores) with
mpiexec -n 4 nrniv -mpi temp.hoc
takes 3 s. Not using the bulletin board at all and modifying to be a serial program (no use of pc.submit) also takes 3 s.

I recommend that you make use of the local statement in functions for local variables.
func bbsTest() {local id, i, val

www.neuron.yale.edu

Parallel NEURON/H'ware - batch of individual runs

Parallel NEURON/H'ware - batch of individual runs

Re: Parallel NEURON/H'ware - batch of individual runs

Re: Parallel NEURON/H'ware - batch of individual runs

Re: Parallel NEURON/H'ware - batch of individual runs

Re: Parallel NEURON/H'ware - batch of individual runs

Re: Parallel NEURON/H'ware - batch of individual runs

Re: Parallel NEURON/H'ware - batch of individual runs

Re: Parallel NEURON/H'ware - batch of individual runs