[Halld-offline] jana hangs

Sat Dec 15 23:23:21 EST 2012

Hi Richard,

   event.xml has the following tag defined:

<random minOccurs="0" maxOccurs="1" seed1="int" seed2="int" 
seed_mcsmear1="int" seed_mcsmear2="int" seed_mcsmear3="int"/>

The three "seed_mcsmearX" values are filled in by mcsmear at the 
beginning of the event to record the seeds used by it. If I had the 
hdgeant.hddm file and just the first event from hdgeant_smeared.hddm, I 
could set the mcsmear seeds and run through the hdgeant.hddm file with 
the same random number sequence that caused the issue when it ran on the 
grid. As I mentioned in my earlier e-mail, the unique seeds used by 
mcsmear when it starts is a feature the frees the user from having to 
supply unique seeds to each job. The seeds are recorded though so we can 
replay the same sequence to identify the cause of erroneous behavior 
such as you're seeing. I wouldn't need the entire hdgeant_smeared.hddm 
file, just the first event which is presumably intact.

It sounds like these files are just not available. In hindsight, we 
should probably have added the "random" tag to the rest format as well. 
That would have allowed us to reproduce the full simulation and smearing 
for jobs where problems were identified.

I went ahead and generated 50k events on Friday and ran them through 
hdgeant. I started a loop to run mcsmear on the hdgeant.hddm file 
repeatedly so hopefully the problem will eventually be reproduced and 
the mcsmear program will hang. I'll have to check it on Monday.

Regards,
-Dave

On 12/15/12 3:40 PM, Richard Jones wrote:
> Dave,
>
> I don't see how to recover that information because the runs are not 
> reproducible from mcsmear forward.  If I rerun that job, I will get 
> the same hdgeant file but then running mcsmear on that hdgeant.hddm 
> file will not retrace the steps that led to the hang.  We are not 
> keeping the hdgeant and mcsmear files, just the rest.hddm output file 
> (which was not generated in this case because mcsmear hung), but I can 
> regenerate the hdgeant.hddm file for you.  I am doing that now.
>
> You also could do it yourself on the ifarm, using Mark's production 
> scripts.  It is file index 1999891.
>
> However, you will probably find that mcsmear runs through it without 
> any problems on a second pass through it.  In my mind, this is a 
> perfect illustration of why pulling entropy from the environment of a 
> running process is a bad idea in any code one intends to put into 
> production.
>
> -Richard J.
>
>
>
>
> On 12/13/2012 10:13 AM, David Lawrence wrote:
>>
>> Hi Richard,
>>
>>   Is it possible to send or make available the hdgeant.hddm file and 
>> the truncated hdgeant_smeared.hddm file for run 891? I can pull the 
>> seeds for mcsmear from the first event and try reproducing the hang 
>> to see if I can track down the cause. I could actually do it for any 
>> of the hung jobs where you have the files available. Run 891 just 
>> hung quicker than some others so it might make it quicker to debug.
>>
>> Regards,
>> -Dave
>>
>> On 12/12/12 6:01 PM, Richard Jones wrote:
>>> Dave,
>>>
>>> I have collected a bunch of log files for you to browse through.   
>>> Look at
>>>
>>> http://zeus.phys.uconn.edu/halld/gridwork/badlogs/dana-hangs
>>>
>>> These logs come in stderr.N and stdout.N pairs.  They contain all of 
>>> the output from the start of the job, including some inane messages 
>>> from the job setup scripts. The jobs attempt to run the analysis 
>>> chain twice, so you can see what happens when the same thing is 
>>> attempted two times in a row on the same machine, to look for 
>>> reproducibility.
>>>
>>> Generically what you see is the following: hd_ana is running along 
>>> merrily, and then into the error log pops a message like:
>>>
>>> JANA ERROR>> Thread 0 hasn't responded in 30 seconds. 
>>> (run:event=9000:5926) Cancelling ...
>>> JANA ERROR>>Caught HUP signal for thread 0x2aaab3b55940 thread 
>>> exiting...
>>> JANA ERROR>> Launching new thread ...
>>>
>>> Then after the launching new thread, everything goes dark. The 
>>> process just sits dormant, neither writing to disk nor consuming 
>>> cpu.  You can see the command line used to start hd_ana in the 
>>> stdout file.  As far as I know, there is just one processing thread 
>>> in operation.  In most of these logs, you see the signal sent by the 
>>> batch job monitor after the wall clock has passed the quota for the 
>>> job.  In some of the later jobs, you might also see evidence of the 
>>> signal 15 that my watchdog script uses to try and clear out hung jobs.
>>>
>>> You can also see evidence of an occasional problem reported by the 
>>> bzlib compression library, that an invalid compression code has been 
>>> requested.  I should probably look into that after all of this is 
>>> over.  That only happens on certain runs, but from the logs it 
>>> appears to be reproducible.
>>>
>>> If you are interested, you can also see logs from the other general 
>>> class of failed jobs, where hdgeant segfaulted. That is reproducible 
>>> and happily it appears upstream of mcsmear where irreproducible 
>>> behavior begins, so it should be quick to find and fix.  A sampling 
>>> of logs of this type are found in the web folder:
>>>
>>> http://zeus.phys.uconn.edu/halld/gridwork/badlogs/hdgeant-segfaults
>>>
>>> -Richard J.
>>>
>>>
>>>
>>>
>>>
>>> On 12/12/2012 5:03 PM, David Lawrence wrote:
>>>>
>>>> Hi Richard,
>>>>
>>>>    At this point, there shouldn't be anything in JANA that looks 
>>>> for things on the web. I know both gxtwist and the hdparsim plugin 
>>>> used curl to grab files from the web, but I don't believe you are 
>>>> using either of those for the data challenge. You probably are also 
>>>> not using the CCDB since that is not yet the default. I'm at a loss 
>>>> as to what might cause a network access if that is what's going on.
>>>>
>>>>   I would really like to get to the bottom of the JANA hangs. A 
>>>> long-term solution that requires a watcher program seems 
>>>> unacceptable to me. At least before we make a strong attempt to 
>>>> track down the cause. Can you give me a little more information:
>>>>
>>>> - What JANA program is hanging?
>>>> - How many processing threads are you using?
>>>> - Is the output just a continuous cycling of the "thread X has not 
>>>> responded ..." messages?
>>>>
>>>> Regards,
>>>> -David
>>>>
>>>>
>>>> On 12/12/12 3:20 PM, Richard Jones wrote:
>>>>> Mark and David,
>>>>>
>>>>> Ok, sounds good.  I was late for the meeting because I triggered a 
>>>>> security flag at Fermilab.  It is being worked out.  Apparently 
>>>>> too many offsite connections per hour to be physics.  Hmmm, what 
>>>>> could he be doing??  It is being worked out.
>>>>>
>>>>> With regards to this, something looks suggestive about the way the 
>>>>> hd_ana is hanging in streaks.  It seems to run fine for a while, 
>>>>> then across a site a wave of hangs seems to set in.  Another thing 
>>>>> looks odd, that I have seen the message "DParticle: loading 
>>>>> constants from the data base" or something like that, and then - 
>>>>> lights out!  Is it possible that there are secret web accesses 
>>>>> going on from inside the application, something like fetches of 
>>>>> data or documents from remote web servers on-the-fly inside the 
>>>>> application.  I know it is a long shot, but this security flag got 
>>>>> me to thinking about what I could be doing that I don't realize.
>>>>>
>>>>> -Richard J.
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20121215/08a95022/attachment-0002.html>