<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
Hi All,<br>
<br>
There is a way to treat this symptomatically, though it will not
tell us why this is happening.<br>
<br>
The configuration parameter JANA:MAX_RELAUNCH_THREADS can be used to
set the "Max. number of times to relaunch a thread due to it timing
out before forcing program to quit."<br>
<br>
(You can get a list of defined configuration parameters for a job by
adding the --dumpconfig argument to most jana/dana programs).<br>
<br>
Regards,<br>
-David<br>
<br>
<br>
<div class="moz-cite-prefix">On 12/9/12 8:58 PM, Richard Jones
wrote:<br>
</div>
<blockquote cite="mid:50C541AA.2010008@uconn.edu" type="cite"> Hello
dc1.1 watchers,<br>
<br>
I have run into a problem that I think we will need to solve
somehow. I have been seeing good cpu allocation on the grid,
stabilizing at around 5000 cpus over the weekend. However I
noticed on Saturday that the rate at which jobs are finishing was
as if only 4000 cpus were running, and this has continued to drop
over the weekend. I investigated, and found that a certain
fraction of the jobs are getting stuck in the middle of the dana
processing step, and just hanging in mid-air. There are a few
jobs that blow the memory budget and crash out, but not these
ones. These jobs do not consume memory or even continue to spin
cpu, but neither do they exit. They just go to sleep in some
funny way, and remain in that state until someone comes along and
kills them. This is what I see in the logs just prior to sending
the kill signal:<br>
<br>
<b id="internal-source-marker_0.4436299267690629"><b>stderr:</b></b><br>
<b id="internal-source-marker_0.4436299267690629">JANA
ERROR>> Thread 0 hasn't responded in 300 seconds.
(run:event=9000:37847) Cancelling ...<br>
JANA ERROR>>Caught HUP signal for thread 0x411d9940 thread
exiting...<br>
JANA ERROR>> Launching new thread ...<br>
<br>
<b>stdout:<br>
</b></b>... the normal old ticker display, frozen at the point
where the problem occurred...<b
id="internal-source-marker_0.4436299267690629"><b><br>
<br>
</b></b>The grid batch system does not kill the job until its
cpu allocation expires, normally 24 hours on most sites. So my
jobs that normally take 7 hours and change are hanging around for
24 hours and then getting evicted to start again. I have not
watched long enough to see if the same job hangs in the same place
the second time around. But when I checked this evening I found
that more than 1000 of my running jobs were in this state. <br>
<br>
Here is the question: do we need to launch an auditor shell script
to watch the hd_ana process and kill it if it hangs? I can do
this, but I am not sure if this is the recommended solution. It
seems like there is already a watcher thread inside JANA that is
printing the error listed above to stderr, so would this be a
watcher-watcher? Then what if the watcher-watcher hangs? I
suppose one could have a watch-watcher-watcher that watchers the
watchering watch. In case you were wondering about that
reference, Mark, that is Dr Seuss, a step up from Seinfeld.<br>
<br>
-Richard J.<b id="internal-source-marker_0.4436299267690629"><b><br>
</b></b> <br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Halld-offline mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Halld-offline@jlab.org">Halld-offline@jlab.org</a>
<a class="moz-txt-link-freetext" href="https://mailman.jlab.org/mailman/listinfo/halld-offline">https://mailman.jlab.org/mailman/listinfo/halld-offline</a></pre>
</blockquote>
<br>
</body>
</html>