[Halld-offline] production issue #2: irreproducibility of runs

Richard Jones richard.t.jones at uconn.edu
Sun Dec 9 22:28:51 EST 2012


Hello dc1.1 watchers,

Another issue has emerged in examining the problems with hd_ana hangs:  running the same job twice does not produce the same files. This seems like a real defect, too bad we didn't realize it until now.  It is essential for chasing down rare events (segfaults, memleaks, bad events) that running a given job a second time leads to the same outcome.  I found this is not the case at the level of the output rest files, so I followed it back.

  * the bggen.hddm files are identical
  * the hdgeant.hddm files are identical
  * the hdgeant_smeared.hddm files are different
  * the rest.hddm files are different

It seems that mcsmear is getting entropy from who-knows-where and injecting it into the data stream.  There are all sorts of quick-and-dirty ways to do that (mangle the date/time, read from /dev/random, etc.) none of them any good for real production.

This is something to come back to after the run.  For the present purposes, if you were hoping to be able to do triage on jobs that are failing in hd_ana during the first pass and study them later, that is not going to work.

-Richard J.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/halld-offline/attachments/20121209/cd378280/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3232 bytes
Desc: S/MIME Cryptographic Signature
Url : https://mailman.jlab.org/pipermail/halld-offline/attachments/20121209/cd378280/attachment-0001.bin 


More information about the Halld-offline mailing list