<html>

  <head>

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hello dc1.1 watchers,<br>

    <br>

    I have run into a problem that I think we will need to solve

    somehow.&nbsp; I have been seeing good cpu allocation on the grid,

    stabilizing at around 5000 cpus over the weekend.&nbsp; However I noticed

    on Saturday that the rate at which jobs are finishing was as if only

    4000 cpus were running, and this has continued to drop over the

    weekend.&nbsp; I investigated, and found that a certain fraction of the

    jobs are getting stuck in the middle of the dana processing step,

    and just hanging in mid-air.&nbsp; There are a few jobs that blow the

    memory budget and crash out, but not these ones.&nbsp; These jobs do not

    consume memory or even continue to spin cpu, but neither do they

    exit.&nbsp; They just go to sleep in some funny way, and remain in that

    state until someone comes along and kills them.&nbsp; This is what I see

    in the logs just prior to sending the kill signal:<br>

    <br>

    <b id="internal-source-marker_0.4436299267690629"><b>stderr:</b></b><br>

    <b id="internal-source-marker_0.4436299267690629">JANA

        ERROR&gt;&gt; Thread 0 hasn't responded in 300 seconds.

        (run:event=9000:37847) Cancelling ...<br>

      JANA

        ERROR&gt;&gt;Caught HUP signal for thread 0x411d9940 thread

        exiting...<br>

      JANA

        ERROR&gt;&gt; Launching new thread ...<br>

      <br>

      <b>stdout:<br>

      </b></b>... the normal old ticker display, frozen at the point

    where the problem occurred...<b id="internal-source-marker_0.4436299267690629"><b><br>

        <br>

      </b></b>The grid batch system does not kill the job until its cpu

    allocation expires, normally 24 hours on most sites.&nbsp; So my jobs

    that normally take 7 hours and change are hanging around for 24

    hours and then getting evicted to start again.&nbsp; I have not watched

    long enough to see if the same job hangs in the same place the

    second time around.&nbsp; But when I checked this evening I found that

    more than 1000 of my running jobs were in this state.&nbsp; <br>

    <br>

    Here is the question: do we need to launch an auditor shell script

    to watch the hd_ana process and kill it if it hangs?&nbsp; I can do this,

    but I am not sure if this is the recommended solution.&nbsp; It seems

    like there is already a watcher thread inside JANA that is printing

    the error listed above to stderr, so would this be a

    watcher-watcher?&nbsp; Then what if the watcher-watcher hangs?&nbsp; I suppose

    one could have a watch-watcher-watcher that watchers the watchering

    watch.&nbsp; In case you were wondering about that reference, Mark, that

    is Dr Seuss, a step up from Seinfeld.<br>

    <br>

    -Richard J.<b id="internal-source-marker_0.4436299267690629"><b><br>

      </b></b>

  </body>

</html>