[Halld-offline] fix for ZFATAL crashes in HDGeant
jrsteven at mit.edu
Thu Mar 6 16:11:02 EST 2014
Hi Richard, All,
About the small REST files... I ran some jobs (55 finished so far) where I used the danarest plugin to make REST files with and without compression from the same hdgeant_smeared.hddm file. Below are a couple file sizes with one example of a small REST file
Compression turned off:
-rw-r--r-- 1 jrsteven jrsteven 110M Mar 6 14:34 rest_102210.hddm
-rw-r--r-- 1 jrsteven jrsteven 110M Mar 6 15:33 rest_102220.hddm
Compression turned on:
-rw-r--r-- 1 jrsteven jrsteven 65M Mar 6 14:34 rest_compressed_102210.hddm
-rw-r--r-- 1 jrsteven jrsteven 43M Mar 6 15:33 rest_compressed_102220.hddm
102210 is the expected result where compressed is 2x smaller than when compression is turned off.
102220 is an example of a "small REST file" where the hd_root job finished as expected, processing all 25K events (as seen in the monitoring histograms and jana output) without any observable crashes.
So I think we have some evidence that compression is related to the small REST files.
PS. These jobs were run with BGRATE 1.10 and BGGATE +/-800 for 25K events each.
On Mar 5, 2014, at 5:44 PM, Richard Jones wrote:
> Hello all,
> I have checked in a fix for the ZFATAL crashes in HDGeant. My initial tests show that it is robust. Please check it out for yourselves at high bg rate.
> During my investigations, I also found a subtle bug that was introduced in some but not all of the hits generation functions in HDGeant, called hitXXX.c, where XXX is one of the subsystems. If you care to read about it, look at the svn log comments that I attached to the latest update to hitFDC.c or one of the others. Simon might want to respond to this. The desired outcome can be achieved by applying the cut on maximum hits per channel somewhere else in the simulation chain, either in mcsmear or at hits input to hd_ana. Once the hits have been ordered in time, blind to the truth information about the track that produced them, then truncation is perfectly fine.
> The above fixes have been applied to both the main trunk code and the dc-2 branch. Please check it out and squeeze it for bugs. I am still working to reproduce the short hddm output file effect. What I am getting instead is a few percent of the jobs crashing with the famous "thread killed after XX seconds" error. I assume that we are excluding these cases from the study of jobs producing short files, otherwise, it kind of makes no sense to be looking at compression as a possible culprit.
> -Richard J.
> Halld-offline mailing list
> Halld-offline at jlab.org
More information about the Halld-offline