[Halld-offline] fix for ZFATAL crashes in HDGeant

Justin Stevens jrsteven at mit.edu
Thu Mar 6 21:10:22 EST 2014


I have a similar report to Paul on testing the ZFATAL fix at MIT.  I've run 86 jobs, 5K events each with BGRATE 5.50, BGGATE +/- 800.0, and haven't seen any crashes.
-Justin

On Mar 6, 2014, at 3:11 PM, Paul Mattione wrote:

> I've tested this here at CMU: 96 jobs, 5k-events each, 2 GB memory, BGRATE 5.50, BGGATE +/- 800.0, REST compression enabled.  The jobs took about 18hrs, had no crashes, and there was only 1 small REST file (of course, it was only 5k events).  Output file sizes are ~820 MB for hdgeant_smeared, 17 MB for REST, 1.5 MB for hdroot.  
> 
> - Paul
> 
> On Mar 5, 2014, at 5:44 PM, Richard Jones wrote:
> 
>> Hello all,
>> 
>> I have checked in a fix for the ZFATAL crashes in HDGeant.  My initial tests show that it is robust.  Please check it out for yourselves at high bg rate.
>> 
>> During my investigations, I also found a subtle bug that was introduced in some but not all of the hits generation functions in HDGeant, called hitXXX.c, where XXX is one of the subsystems.  If you care to read about it, look at the svn log comments that I attached to the latest update to hitFDC.c or one of the others.  Simon might want to respond to this.  The desired outcome can be achieved by applying the cut on maximum hits per channel somewhere else in the simulation chain, either in mcsmear or at hits input to hd_ana.  Once the hits have been ordered in time, blind to the truth information about the track that produced them, then truncation is perfectly fine.
>> 
>> The above fixes have been applied to both the main trunk code and the dc-2 branch.  Please check it out and squeeze it for bugs.  I am still working to reproduce the short hddm output file effect.  What I am getting instead is a few percent of the jobs crashing with the famous "thread killed after XX seconds" error. I assume that we are excluding these cases from the study of jobs producing short files, otherwise, it kind of makes no sense to be looking at compression as a possible culprit.
>> 
>> -Richard J.
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
> 
> 
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline




More information about the Halld-offline mailing list