[Halld-offline] pythia-geant.map, back to basics

Fri Dec 20 10:15:00 EST 2013

Just to follow up on the problem of REST not finishing,
I ran 3000 runs (10k events/run, 30M events total) on
the stanley cluster at IU, which had a failure rate of
261 / 5000 = 5.22% when I ran 2 months ago,
mostly due to REST not finishing.

For this round (rep. ver. 12242), I found 3000 REST files,
with no runs failing to produce a REST file. I believe the issue
that I suspected to be some kind of network connection failure
seems to have been completely resolved, thanks to the fix
that Simon put in for in TRACKING/DTrackFitterKalmanSIMD.cc
for rep. ver. 11918.

I ran hd_dump with no libraries specified over the 3000 REST files
to figure out how many events they contained, and the results are:
1. 2932 runs had  10k events
2.   15 runs had 9999 events
3.   33 runs segfault
4.   18 don't finish
5.    2 finish hd_dump with no more events, but have <10k events

In my talk at the offline meeting on November 13, I discussed
finding the message
"ZFATAL called from MZPUSH" for hdgeant runs that did not
process all events, and the message
"Caught HUP signal for thread 0x..."
for mcsmear runs that did not process all events.

These occurred for 10 runs for hdgeant, and 5 runs for mcsmear
out of 5000 total runs. This time around I was not able to
find a single case for either of these in 3000 runs.
This does not seem like a fluke, and I wonder if  anybody checked
anything in to fix these problems...

Currently the largest software issue in running bggen events
seem to be items 3. and 4. in the list above.
These are probably the same issues I found before, where in
some cases isolating the single event will give a reproducible
failure, while in others the pile-up of previous events
are necessary to reproduce the result.

Our current success rate is 97.7%, up from 94.8%
two months ago.

	Kei

On 12/18/13 2:12 PM, Mark Ito wrote:
> Kei,
>
> On 12/18/2013 11:53 AM, Kei Moriya wrote:
>> I assume this proposed change affects the geantId code that you
>> checked in?
> This does not effect that change. Also note that geantId is a mis-nomer
> for the change checked in. That was appropriate for the previous
> abandoned approach.
>> At the very least we can assume that the REST not stopping
>> problem is solved.
> Good news!
>>
>> In previous batch runs most of the failures I saw were from
>> either REST getting stuck at one event or not finishing,
>> which I assumed to be a database connection problem,
>> but that seems to be gone.
> A database connection problem would stop a job at the very beginning. It
> would also be obvious in the log file; there would be a complaint about
> getting some sort of constants.
>>
>> I can start analyzing the bggen events I currently have,
>> or if there will be significant changes to the geantId
>> due to the new changes, I can wait to generate/simulate
>> a new batch.
> I would wait for me to check in the new changes. Alternately you can use
> the revision of
> pythia-geant.map attached to my last message, or check out the version
> from revision 3195.
>
>     -- Mark
>