[Halld-offline] hd_ana hanging after variable number of bggen events

Wed Mar 16 10:32:31 EDT 2011

Hi Richard,

     That's a good theory. However, if that were the case, printing the 
two values being compared (Ra and Rb) would show one of them is NaN. 
When we looked at this before (and I repeated it just now) Matt and I 
both found printing the values avoided the hang/crash and there did not 
seem to be a NaN anywhere. At least that I saw. I'll let you confirm 
that this is what you see too.

I'll keep you up on any progress I make.

Regards,
-David

On 3/16/11 10:18 AM, Richard Jones wrote:
> David,
>
> Yes, I have reproduced it as well, and like you point out, it can be 
> reproduced with just a single event.  I have found the instructions 
> that cause the problem, and am checking into how compare is failing to 
> go false before it runs off the end of the list.  One possibility is 
> that there is a NaN in one of the registers, but I have to remember 
> how to dump the floating point stack in the IA32 floating point unit.  
> This compare is being performed by the FUCOMPP instruction, which sets 
> the Unordered flag in the FP status register if you try to do a 
> compare with an "uncomparable value" like NaN.  I don't think the 
> SortInteractions() function checks for that, and the Perp() method 
> probably does a square root which might make a NaN under some 
> circumstances.  I suspect that it is something related, but only have 
> a few minutes here and there to look at it.  Have to go teach, more 
> later today...
>
> -Richard J.
>
>
> On 3/16/2011 8:21 AM, David Lawrence wrote:
>> Hi Richard (and offliners),
>>
>>      Just a quick update that I have been able to reproduce the 
>> problem and it does appear to be the same problem earlier reported. I 
>> have added a note to the issue in Mantis 
>> (https://halldnew.jlab.org/mantisbt/view.php?id=38) indicating this. 
>> I now have an event which seems to reliably cause the problem though 
>> the exact symptoms seem to vary a little. This is likely due to the 
>> sort-gone-crazy problem corrupting memory in a way that causes 
>> different behavior depending on details of how things are laid out 
>> when the corruption occurs.
>>
>>      No real clues yet on the underlying cause. If it is a bug in the 
>> optimizer as Matt suspects, then we'll have to treat it 
>> symptomatically. I'm going to try to get a little more proof that 
>> that is the case before resorting to that however.
>>
>>      We'll have to discuss the backtrace issue some more at the 
>> offline meeting. I may not be understanding all of the details. I 
>> will say for this particular problem, I was able to launch gdb and 
>> attach it to an existing process while it was hung and view the stack 
>> trace for all threads.
>>
>> Regards,
>> -Dave
>>
>> On 3/14/11 8:45 AM, Richard Jones wrote:
>> Dave,
>>
>> Ok, I thought it might be related.  I think the following represents 
>> progress on this front:
>>
>>   1.  It is not a "hang" but a segfault in child thread #2
>>   2.  Segfault in a child thread causes the code to hang.  This seems 
>> to be because we are going through the root signal handling 
>> mechanism, which is badly broken in the JANA context.
>>
>> I suggest that the top priority is to fix item #2, by writing your 
>> own signal recovery and backtrace mechanism for the JANA framework.  
>> This seems like a first-order requirement for our analysis framework, 
>> to have a signal recovery and backtrace mechanism with an appropriate 
>> behavior.  Once that is done, tracing other problems, such as item 
>> #1, will be more feasible.
>>
>> -Richard J.
>>
>>
>>
>>
>> On 3/14/2011 8:37 AM, David Lawrence wrote:
>>
>>
>> Hi Richard,
>>
>>      The DTrackCandidate_factory_CDC::FindThetaZRegression() has come 
>> up before when Kei and Jake reported problems with DANA programs 
>> hanging as far back as December. This led to the "fix" currently used 
>> (though not incorporated into the build system) where optimization is 
>> turned off when compiling DTrackCandidate_factory_CDC.cc. As of yet, 
>> we have not been able to identify the bug exactly as the behavior is 
>> not deterministic.
>>
>>      I will take another look at this today to see if I can make some 
>> more headway on the problem.
>>
>> Regards,
>> -Dave
>>
>>
>>
>
>