<html>

  <head>

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hello dc1.1 watchers,<br>

    <br>

    I have run into a problem that I think we will need to solve

    somehow.  I have been seeing good cpu allocation on the grid,

    stabilizing at around 5000 cpus over the weekend.  However I noticed

    on Saturday that the rate at which jobs are finishing was as if only

    4000 cpus were running, and this has continued to drop over the

    weekend.  I investigated, and found that a certain fraction of the

    jobs are getting stuck in the middle of the dana processing step,

    and just hanging in mid-air.  There are a few jobs that blow the

    memory budget and crash out, but not these ones.  These jobs do not

    consume memory or even continue to spin cpu, but neither do they

    exit.  They just go to sleep in some funny way, and remain in that

    state until someone comes along and kills them.  This is what I see

    in the logs just prior to sending the kill signal:<br>

    <br>

    <b id="internal-source-marker_0.4436299267690629"><b>stderr:</b></b><br>

    <b id="internal-source-marker_0.4436299267690629">JANA

        ERROR>> Thread 0 hasn't responded in 300 seconds.

        (run:event=9000:37847) Cancelling ...<br>

      JANA

        ERROR>>Caught HUP signal for thread 0x411d9940 thread

        exiting...<br>

      JANA

        ERROR>> Launching new thread ...<br>

      <br>

      <b>stdout:<br>

      </b></b>... the normal old ticker display, frozen at the point

    where the problem occurred...<b id="internal-source-marker_0.4436299267690629"><b><br>

        <br>

      </b></b>The grid batch system does not kill the job until its cpu

    allocation expires, normally 24 hours on most sites.  So my jobs

    that normally take 7 hours and change are hanging around for 24

    hours and then getting evicted to start again.  I have not watched

    long enough to see if the same job hangs in the same place the

    second time around.  But when I checked this evening I found that

    more than 1000 of my running jobs were in this state.  <br>

    <br>

    Here is the question: do we need to launch an auditor shell script

    to watch the hd_ana process and kill it if it hangs?  I can do this,

    but I am not sure if this is the recommended solution.  It seems

    like there is already a watcher thread inside JANA that is printing

    the error listed above to stderr, so would this be a

    watcher-watcher?  Then what if the watcher-watcher hangs?  I suppose

    one could have a watch-watcher-watcher that watchers the watchering

    watch.  In case you were wondering about that reference, Mark, that

    is Dr Seuss, a step up from Seinfeld.<br>

    <br>

    -Richard J.<b id="internal-source-marker_0.4436299267690629"><b><br>

      </b></b>

  </body>

</html>