<html>
<head>
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hello dc1.1 watchers,<br>
<br>
I have run into a problem that I think we will need to solve
somehow. I have been seeing good cpu allocation on the grid,
stabilizing at around 5000 cpus over the weekend. However I noticed
on Saturday that the rate at which jobs are finishing was as if only
4000 cpus were running, and this has continued to drop over the
weekend. I investigated, and found that a certain fraction of the
jobs are getting stuck in the middle of the dana processing step,
and just hanging in mid-air. There are a few jobs that blow the
memory budget and crash out, but not these ones. These jobs do not
consume memory or even continue to spin cpu, but neither do they
exit. They just go to sleep in some funny way, and remain in that
state until someone comes along and kills them. This is what I see
in the logs just prior to sending the kill signal:<br>
<br>
<b id="internal-source-marker_0.4436299267690629"><b>stderr:</b></b><br>
<b id="internal-source-marker_0.4436299267690629">JANA
ERROR>> Thread 0 hasn't responded in 300 seconds.
(run:event=9000:37847) Cancelling ...<br>
JANA
ERROR>>Caught HUP signal for thread 0x411d9940 thread
exiting...<br>
JANA
ERROR>> Launching new thread ...<br>
<br>
<b>stdout:<br>
</b></b>... the normal old ticker display, frozen at the point
where the problem occurred...<b id="internal-source-marker_0.4436299267690629"><b><br>
<br>
</b></b>The grid batch system does not kill the job until its cpu
allocation expires, normally 24 hours on most sites. So my jobs
that normally take 7 hours and change are hanging around for 24
hours and then getting evicted to start again. I have not watched
long enough to see if the same job hangs in the same place the
second time around. But when I checked this evening I found that
more than 1000 of my running jobs were in this state. <br>
<br>
Here is the question: do we need to launch an auditor shell script
to watch the hd_ana process and kill it if it hangs? I can do this,
but I am not sure if this is the recommended solution. It seems
like there is already a watcher thread inside JANA that is printing
the error listed above to stderr, so would this be a
watcher-watcher? Then what if the watcher-watcher hangs? I suppose
one could have a watch-watcher-watcher that watchers the watchering
watch. In case you were wondering about that reference, Mark, that
is Dr Seuss, a step up from Seinfeld.<br>
<br>
-Richard J.<b id="internal-source-marker_0.4436299267690629"><b><br>
</b></b>
</body>
</html>