[Halld-offline] frozen jobs - is a watchdog monitor required for jana apps?
David Lawrence
davidl at jlab.org
Sun Dec 9 22:48:00 EST 2012
Hi All,
There is a way to treat this symptomatically, though it will not tell
us why this is happening.
The configuration parameter JANA:MAX_RELAUNCH_THREADS can be used to set
the "Max. number of times to relaunch a thread due to it timing out
before forcing program to quit."
(You can get a list of defined configuration parameters for a job by
adding the --dumpconfig argument to most jana/dana programs).
Regards,
-David
On 12/9/12 8:58 PM, Richard Jones wrote:
> Hello dc1.1 watchers,
>
> I have run into a problem that I think we will need to solve somehow.
> I have been seeing good cpu allocation on the grid, stabilizing at
> around 5000 cpus over the weekend. However I noticed on Saturday that
> the rate at which jobs are finishing was as if only 4000 cpus were
> running, and this has continued to drop over the weekend. I
> investigated, and found that a certain fraction of the jobs are
> getting stuck in the middle of the dana processing step, and just
> hanging in mid-air. There are a few jobs that blow the memory budget
> and crash out, but not these ones. These jobs do not consume memory
> or even continue to spin cpu, but neither do they exit. They just go
> to sleep in some funny way, and remain in that state until someone
> comes along and kills them. This is what I see in the logs just prior
> to sending the kill signal:
>
> **stderr:**
> *JANA ERROR>> Thread 0 hasn't responded in 300 seconds.
> (run:event=9000:37847) Cancelling ...
> JANA ERROR>>Caught HUP signal for thread 0x411d9940 thread exiting...
> JANA ERROR>> Launching new thread ...
>
> *stdout:
> **... the normal old ticker display, frozen at the point where the
> problem occurred...**
>
> **The grid batch system does not kill the job until its cpu allocation
> expires, normally 24 hours on most sites. So my jobs that normally
> take 7 hours and change are hanging around for 24 hours and then
> getting evicted to start again. I have not watched long enough to see
> if the same job hangs in the same place the second time around. But
> when I checked this evening I found that more than 1000 of my running
> jobs were in this state.
>
> Here is the question: do we need to launch an auditor shell script to
> watch the hd_ana process and kill it if it hangs? I can do this, but
> I am not sure if this is the recommended solution. It seems like
> there is already a watcher thread inside JANA that is printing the
> error listed above to stderr, so would this be a watcher-watcher?
> Then what if the watcher-watcher hangs? I suppose one could have a
> watch-watcher-watcher that watchers the watchering watch. In case you
> were wondering about that reference, Mark, that is Dr Seuss, a step up
> from Seinfeld.
>
> -Richard J.**
> **
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/halld-offline/attachments/20121209/95571550/attachment.html
More information about the Halld-offline
mailing list