<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    <br>

    Hi Richard (and offliners),<br>

    <br>

        Just a quick update that I have been able to reproduce the

    problem and it does appear to be the same problem earlier reported.

    I have added a note to the issue in Mantis

    (<a class="moz-txt-link-freetext" href="https://halldnew.jlab.org/mantisbt/view.php?id=38">https://halldnew.jlab.org/mantisbt/view.php?id=38</a>) indicating this.

    I now have an event which seems to reliably cause the problem though

    the exact symptoms seem to vary a little. This is likely due to the

    sort-gone-crazy problem corrupting memory in a way that causes

    different behavior depending on details of how things are laid out

    when the corruption occurs.<br>

    <br>

        No real clues yet on the underlying cause. If it is a bug in the

    optimizer as Matt suspects, then we'll have to treat it

    symptomatically. I'm going to try to get a little more proof that

    that is the case before resorting to that however.<br>

    <br>

        We'll have to discuss the backtrace issue some more at the

    offline meeting. I may not be understanding all of the details. I

    will say for this particular problem, I was able to launch gdb and

    attach it to an existing process while it was hung and view the

    stack trace for all threads.<br>

    <br>

    Regards,<br>

    -Dave<br>

    <br>

    On 3/14/11 8:45 AM, Richard Jones wrote:

    <blockquote cite="mid:4D7E0DE8.20908@uconn.edu" type="cite"> Dave,<br>

      <br>

      Ok, I thought it might be related.  I think the following

      represents progress on this front:<br>

      <ol>

        <li>It is not a "hang" but a segfault in child thread #2</li>

        <li>Segfault in a child thread causes the code to hang.  This

          seems to be because we are going through the root signal

          handling mechanism, which is badly broken in the JANA context.</li>

      </ol>

      I suggest that the top priority is to fix item #2, by writing your

      own signal recovery and backtrace mechanism for the JANA

      framework.  This seems like a first-order requirement for our

      analysis framework, to have a signal recovery and backtrace

      mechanism with an appropriate behavior.  Once that is done,

      tracing other problems, such as item #1, will be more feasible.<br>

      <br>

      -Richard J.<br>

      <br>

      <br>

      <br>

      <br>

      On 3/14/2011 8:37 AM, David Lawrence wrote:<br>

      <br>

      <blockquote type="cite" cite="mid:4D7E0BEC.8080901@jlab.org">

        <pre wrap="">Hi Richard,

    The DTrackCandidate_factory_CDC::FindThetaZRegression() has come up before when Kei and Jake reported problems with DANA programs hanging as far back as December. This led to the "fix" currently used (though not incorporated into the build system) where optimization is turned off when compiling DTrackCandidate_factory_CDC.cc. As of yet, we have not been able to identify the bug exactly as the behavior is not deterministic.

    I will take another look at this today to see if I can make some more headway on the problem.

Regards,

-Dave

</pre>

      </blockquote>

      <br>

    </blockquote>

  </body>

</html>