[Halld-offline] frozen jobs - is a watchdog monitor required for jana apps?

Richard Jones richard.t.jones at uconn.edu
Sun Dec 9 20:58:02 EST 2012


Hello dc1.1 watchers,

I have run into a problem that I think we will need to solve somehow.  I have been seeing good cpu allocation on the grid, stabilizing at around 5000 cpus over the weekend.  However I noticed on Saturday that the rate at which jobs are finishing was as if only 4000 cpus were running, and this has continued to drop over the weekend.  I investigated, and found that a certain fraction of the jobs are getting stuck in the middle of the dana processing step, and just hanging in mid-air.  There are a few jobs that blow the memory budget and crash out, but not these ones.  These jobs do not consume memory or even continue to spin cpu, but neither do they exit.  They just go to sleep in some funny way, and remain in that state until someone comes along and kills them.  This is what I see in the logs just prior to sending the kill signal:

**stderr:**
*JANA ERROR>> Thread 0 hasn't responded in 300 seconds. (run:event=9000:37847) Cancelling ...
JANA ERROR>>Caught HUP signal for thread 0x411d9940 thread exiting...
JANA ERROR>> Launching new thread ...

*stdout:
**... the normal old ticker display, frozen at the point where the problem occurred...**

**The grid batch system does not kill the job until its cpu allocation expires, normally 24 hours on most sites.  So my jobs that normally take 7 hours and change are hanging around for 24 hours and then getting evicted to start again.  I have not watched long enough to see if the same job hangs in the same place the second time around.  But when I checked this evening I found that more than 1000 of my running jobs were in this state.

Here is the question: do we need to launch an auditor shell script to watch the hd_ana process and kill it if it hangs?  I can do this, but I am not sure if this is the recommended solution.  It seems like there is already a watcher thread inside JANA that is printing the error listed above to stderr, so would this be a watcher-watcher?  Then what if the watcher-watcher hangs?  I suppose one could have a watch-watcher-watcher that watchers the watchering watch.  In case you were wondering about that reference, Mark, that is Dr Seuss, a step up from Seinfeld.

-Richard J.**
**
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/halld-offline/attachments/20121209/7a0ea057/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3232 bytes
Desc: S/MIME Cryptographic Signature
Url : https://mailman.jlab.org/pipermail/halld-offline/attachments/20121209/7a0ea057/attachment.bin 


More information about the Halld-offline mailing list