[Halld-offline] production issue #2: irreproducibility of runs

David Lawrence davidl at jlab.org
Mon Dec 10 11:54:34 EST 2012


Hi All,

   This is a feature we may want to change, but let me describe how it 
works in mcsmear:

- The random numbers are all generated using ROOT's TRandom2. We derive 
our own class DRandom2 from that just to get access to the seeds that 
are protected members. See comments at top of DRandom2.h for details.

- ALL random numbers used in mcsmear should come come from the globally 
defined gDRandom object.

- The initial seeds of the generator are derived from the UUID. This is 
a number made up of the current time (in 100ns units) and node id. This 
means the default for mcsmear is to derive a new and all-but-guaranteed 
unique seed each time it is run. However .....

- At the start of each event, the HDDM file is checked to see if a set 
of seeds is defined. If so, it replaces the seeds in gDRandom with 
those. Regardless of where it gets the seeds, it writes the ones used to 
the output HDDM file. This way, if the output file is used as input to 
mcsmear again (e.g. to re-smear) the same seeds will be used.


There are a couple of ways to get the behavior Richard suggests (and I 
think it is a good suggestion):

1.) Have hdgeant fill the random->seed_mcsmear1, 2, 3... tags in the 
hddm file with some seed values

2.) Add a command-line option to mcsmear so a seed can be set by the 
user (or script) at run time


There is another caveat here. mcsmear can now be run with multiple 
processing threads, though by default, only one is used. If more than 
one is used however, it will not be possible to maintain reproducibility 
without rewriting much of the code to use thread-specific random number 
generator objects.


Regards,
-David


On 12/9/12 10:28 PM, Richard Jones wrote:
> Hello dc1.1 watchers,
>
> Another issue has emerged in examining the problems with hd_ana 
> hangs:  running the same job twice does not produce the same files.  
> This seems like a real defect, too bad we didn't realize it until 
> now.  It is essential for chasing down rare events (segfaults, 
> memleaks, bad events) that running a given job a second time leads to 
> the same outcome.  I found this is not the case at the level of the 
> output rest files, so I followed it back.
>
>   * the bggen.hddm files are identical
>   * the hdgeant.hddm files are identical
>   * the hdgeant_smeared.hddm files are different
>   * the rest.hddm files are different
>
> It seems that mcsmear is getting entropy from who-knows-where and 
> injecting it into the data stream.  There are all sorts of 
> quick-and-dirty ways to do that (mangle the date/time, read from 
> /dev/random, etc.) none of them any good for real production.
>
> This is something to come back to after the run.  For the present 
> purposes, if you were hoping to be able to do triage on jobs that are 
> failing in hd_ana during the first pass and study them later, that is 
> not going to work.
>
> -Richard J.
>
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/halld-offline/attachments/20121210/9f35e6f4/attachment.html 


More information about the Halld-offline mailing list