parallel save/restore state

General issues of interest both for network and
individual cell parallelization.

Moderator: hines

Post Reply
evan

parallel save/restore state

Post by evan »

I'm using Michael Hines' parallel version of the Traub model with 7.1, which roughly looks like

Code: Select all

prun("savefile.dat")

proc prun() {
	pc.setup_transfer()
        pnm.set_maxstep(10)
        runtime=startsw()
        tdat_.x[0] = pnm.pc.wait_time
        stdinit()
	
restorestate($s1)
	
        pnm.psolve(tstop)
        tdat_.x[0] = pnm.pc.wait_time - tdat_.x[0]
        runtime = startsw() - runtime
        tdat_.x[1] = pnm.pc.step_time
        tdat_.x[2] = pnm.pc.send_time
	tdat_.x[3] = pc.vtransfer_time(0) // for gaps
	tdat_.x[4] = pc.vtransfer_time(1) // for splitcells
	//      printf("%d wtime %g\n", pnm.myid, waittime)

savestate($s1)
}

proc savestate() {local i  localobj s, ss, f, rl
	s = new String()
	sprint(s.s, $s1)
	f = new File(s.s)
	ss = new SaveState()
	ss.save()
	ss.fwrite(f, 0)

	rl = new List("Random")
	f.printf("Random %d\n", rl.count)
	for i=0, rl.count-1 {
		f.printf("%d\n", rl.object(i).seq())
	}
	f.close
}

proc restorestate() {local i  localobj s, ss, f, rl
	s = new String()
	sprint(s.s, $s1)
	f = new File(s.s)
	ss = new SaveState()
	ss.fread(f, 0)
	rl = new List("Random")
	print rl.count
	if (f.scanvar() != rl.count) {
		execerror("Random count unexpected", "")
	}
	for i=0, rl.count-1 {
		rl.object(i).seq(f.scanvar())
	}
	f.close
	ss.restore()
}
When I restore in parallel mode I get the following on each of the nodes:

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[mumbler:04219] *** Process received signal ***
[mumbler:04219] Signal: Aborted (6)
[mumbler:04219] Signal code: (-6)
[mumbler:04219] [ 0] /lib/libpthread.so.0 [0x7f91425ec190]
[mumbler:04219] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f91422a14b5]
[mumbler:04219] [ 2] /lib/libc.so.6(abort+0x180) [0x7f91422a4f50]
[mumbler:04219] [ 3] /usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x115) [0x7f9142ad8cc5]
[mumbler:04219] [ 4] /usr/lib/libstdc++.so.6 [0x7f9142ad70f6]
[mumbler:04219] [ 5] /usr/lib/libstdc++.so.6 [0x7f9142ad7123]
[mumbler:04219] [ 6] /usr/lib/libstdc++.so.6 [0x7f9142ad721e]
[mumbler:04219] [ 7] /usr/lib/libstdc++.so.6(_Znwm+0x7d) [0x7f9142ad76ad]
[mumbler:04219] [ 8] /usr/lib/libstdc++.so.6(_Znam+0x9) [0x7f9142ad7769]
[mumbler:04219] [ 9] /home/evan/neuron/x86_64/lib/libnrniv.so.0(_ZN9SaveState4readEP6OcFilej+0x1ae) [0x7f914593426e]
[mumbler:04219] [10] /home/evan/neuron/x86_64/lib/libnrniv.so.0 [0x7f9145932935]
[mumbler:04219] [11] /home/evan/neuron/x86_64/lib/libnrnoc.so.0(call_ob_proc+0x233) [0x7f9145e1e5b3]
[mumbler:04219] [12] /home/evan/neuron/x86_64/lib/libnrnoc.so.0(hoc_object_component+0x4d7) [0x7f9145e1fd97]
[mumbler:04219] [13] /home/evan/neuron/x86_64/lib/libnrnoc.so.0(hoc_execute+0x56) [0x7f9145e16ef6]
[mumbler:04219] [14] /home/evan/neuron/x86_64/lib/libnrnoc.so.0(hoc_call+0x149) [0x7f9145e1b3a9]
[mumbler:04219] [15] /home/evan/neuron/x86_64/lib/libnrnoc.so.0(hoc_execute+0x56) [0x7f9145e16ef6]
[mumbler:04219] [16] /home/evan/neuron/x86_64/lib/libnrnoc.so.0(hoc_call+0x149) [0x7f9145e1b3a9]
[mumbler:04219] [17] /home/evan/neuron/x86_64/lib/libnrnoc.so.0(hoc_execute+0x56) [0x7f9145e16ef6]
[mumbler:04219] [18] /home/evan/neuron/x86_64/lib/liboc.so.0 [0x7f9145bcc141]
[mumbler:04219] [19] /home/evan/neuron/x86_64/lib/liboc.so.0(hoc_main1+0xc7) [0x7f9145bcd257]
[mumbler:04219] [20] /home/evan/neuron/x86_64/bin/nrniv(ivocmain+0x21f) [0x4022af]
[mumbler:04219] [21] /home/evan/neuron/x86_64/bin/nrniv(main+0x4c) [0x401e2c]
[mumbler:04219] [22] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f914228cabd]
[mumbler:04219] [23] /home/evan/neuron/x86_64/bin/nrniv [0x401c29]
[mumbler:04219] *** End of error message ***


TIA,
evan.
hines
Site Admin
Posts: 1691
Joined: Wed May 18, 2005 3:32 pm

Re: parallel save/restore state

Post by hines »

I'm also experiencing an error with the save/restore process. In my case the error message refers to a format inconsistency between
reading and writing. When I resolve this I'll let you know.
hines
Site Admin
Posts: 1691
Joined: Wed May 18, 2005 3:32 pm

Re: parallel save/restore state

Post by hines »

Several bugs in SaveState have been fixed that relate to not saving enough info in the thread version and a problem saving the event queue when the bin queue is used. See:
http://www.neuron.yale.edu/hg/neuron/nr ... 4baaffdc57
Since SaveState needs to know almost all internal structure implementation details it tends to get broken easily when improvments are made in the rest of NEURON. I'm grateful when people point out the problems.

To make it easier to verify that the nrntraub code is recent enough to have all the parallel extensions, I've made the associated mercurial repository public at:
http://www.neuron.yale.edu/hg/z/models/nrntraub/
The modeldb version will tend to lag behind this experimental version. A file that has been added is a script to perform a test of the SaveState functionality in the context of the nrntraub implementation.

sh savestatetest.sh

will run the model to 20ms with 4 processes (see the script to change these values) and save the state at tstop/2. Then the model is run again from tstop/2 to tstop and the spike output is compared. Of course all spikes after tstop/2 should be identical.

Let me know if you experience any problems. I never saw any error message referring to std::bad_alloc
so it may be that something else is also going wrong.
JBall
Posts: 18
Joined: Tue Jun 15, 2010 8:47 pm

Re: parallel save/restore state

Post by JBall »

I am bumping this thread because I have a similar problem as the user in the first post. I have a network with plasticity, and I would like to put it through a series of training events and test it between each training session. Thinking that the savestate class would work perfectly for this, I set out to try it. I found I could save the states easily by creating a unique state file for each core. Upon trying to restore the state, however, I get many errors along the same lines as those shown above. Based on your response, do I need to find and edit the savestate source file, and then re-compile neuron? Thanks in advance for any help you can give.
hines
Site Admin
Posts: 1691
Joined: Wed May 18, 2005 3:32 pm

Re: parallel save/restore state

Post by hines »

do I need to find and edit the savestate source file, and then re-compile neuron?
Yes. However, diagnosis can be tricky so, if you prefer, you can send me a zip file
of your model which exhibits the error and I'll update the savestate.cpp file.
I presume you are using the hg repository version or at least the latest 7.2 alpha
version. Also the save and restore should be setup so that if it works, it should
be reasonably clear that the results of a test after save and after restore are
identical. If you want to go this route send the zip file to me at
michael dot hines at yale dot edu.
Post Reply