<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    Hi All,<br>

    <br>

    &nbsp; There is a way to treat this symptomatically, though it will not

    tell us why this is happening.<br>

    <br>

    The configuration parameter JANA:MAX_RELAUNCH_THREADS can be used to

    set the "Max. number of times to relaunch a thread due to it timing

    out before forcing program to quit."<br>

    <br>

    (You can get a list of defined configuration parameters for a job by

    adding the --dumpconfig argument to most jana/dana programs).<br>

    <br>

    Regards,<br>

    -David<br>

    <br>

    <br>

    <div class="moz-cite-prefix">On 12/9/12 8:58 PM, Richard Jones

      wrote:<br>

    </div>

    <blockquote cite="mid:50C541AA.2010008@uconn.edu" type="cite"> Hello

      dc1.1 watchers,<br>

      <br>

      I have run into a problem that I think we will need to solve

      somehow.&nbsp; I have been seeing good cpu allocation on the grid,

      stabilizing at around 5000 cpus over the weekend.&nbsp; However I

      noticed on Saturday that the rate at which jobs are finishing was

      as if only 4000 cpus were running, and this has continued to drop

      over the weekend.&nbsp; I investigated, and found that a certain

      fraction of the jobs are getting stuck in the middle of the dana

      processing step, and just hanging in mid-air.&nbsp; There are a few

      jobs that blow the memory budget and crash out, but not these

      ones.&nbsp; These jobs do not consume memory or even continue to spin

      cpu, but neither do they exit.&nbsp; They just go to sleep in some

      funny way, and remain in that state until someone comes along and

      kills them.&nbsp; This is what I see in the logs just prior to sending

      the kill signal:<br>

      <br>

      <b id="internal-source-marker_0.4436299267690629"><b>stderr:</b></b><br>

      <b id="internal-source-marker_0.4436299267690629">JANA

        ERROR&gt;&gt; Thread 0 hasn't responded in 300 seconds.

        (run:event=9000:37847) Cancelling ...<br>

        JANA ERROR&gt;&gt;Caught HUP signal for thread 0x411d9940 thread

        exiting...<br>

        JANA ERROR&gt;&gt; Launching new thread ...<br>

        <br>

        <b>stdout:<br>

        </b></b>... the normal old ticker display, frozen at the point

      where the problem occurred...<b

        id="internal-source-marker_0.4436299267690629"><b><br>

          <br>

        </b></b>The grid batch system does not kill the job until its

      cpu allocation expires, normally 24 hours on most sites.&nbsp; So my

      jobs that normally take 7 hours and change are hanging around for

      24 hours and then getting evicted to start again.&nbsp; I have not

      watched long enough to see if the same job hangs in the same place

      the second time around.&nbsp; But when I checked this evening I found

      that more than 1000 of my running jobs were in this state.&nbsp; <br>

      <br>

      Here is the question: do we need to launch an auditor shell script

      to watch the hd_ana process and kill it if it hangs?&nbsp; I can do

      this, but I am not sure if this is the recommended solution.&nbsp; It

      seems like there is already a watcher thread inside JANA that is

      printing the error listed above to stderr, so would this be a

      watcher-watcher?&nbsp; Then what if the watcher-watcher hangs?&nbsp; I

      suppose one could have a watch-watcher-watcher that watchers the

      watchering watch.&nbsp; In case you were wondering about that

      reference, Mark, that is Dr Seuss, a step up from Seinfeld.<br>

      <br>

      -Richard J.<b id="internal-source-marker_0.4436299267690629"><b><br>

        </b></b> <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Halld-offline mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Halld-offline@jlab.org">Halld-offline@jlab.org</a>

<a class="moz-txt-link-freetext" href="https://mailman.jlab.org/mailman/listinfo/halld-offline">https://mailman.jlab.org/mailman/listinfo/halld-offline</a></pre>

    </blockquote>

    <br>

  </body>

</html>