[Halld-offline] frozen jobs - is a watchdog monitor required for jana apps?

Sun Dec 9 22:48:00 EST 2012

Hi All,

   There is a way to treat this symptomatically, though it will not tell 
us why this is happening.

The configuration parameter JANA:MAX_RELAUNCH_THREADS can be used to set 
the "Max. number of times to relaunch a thread due to it timing out 
before forcing program to quit."

(You can get a list of defined configuration parameters for a job by 
adding the --dumpconfig argument to most jana/dana programs).

Regards,
-David

On 12/9/12 8:58 PM, Richard Jones wrote:
> Hello dc1.1 watchers,
>
> I have run into a problem that I think we will need to solve somehow.  
> I have been seeing good cpu allocation on the grid, stabilizing at 
> around 5000 cpus over the weekend.  However I noticed on Saturday that 
> the rate at which jobs are finishing was as if only 4000 cpus were 
> running, and this has continued to drop over the weekend.  I 
> investigated, and found that a certain fraction of the jobs are 
> getting stuck in the middle of the dana processing step, and just 
> hanging in mid-air.  There are a few jobs that blow the memory budget 
> and crash out, but not these ones.  These jobs do not consume memory 
> or even continue to spin cpu, but neither do they exit.  They just go 
> to sleep in some funny way, and remain in that state until someone 
> comes along and kills them.  This is what I see in the logs just prior 
> to sending the kill signal:
>
> **stderr:**
> *JANA ERROR>> Thread 0 hasn't responded in 300 seconds. 
> (run:event=9000:37847) Cancelling ...
> JANA ERROR>>Caught HUP signal for thread 0x411d9940 thread exiting...
> JANA ERROR>> Launching new thread ...
>
> *stdout:
> **... the normal old ticker display, frozen at the point where the 
> problem occurred...**
>
> **The grid batch system does not kill the job until its cpu allocation 
> expires, normally 24 hours on most sites.  So my jobs that normally 
> take 7 hours and change are hanging around for 24 hours and then 
> getting evicted to start again.  I have not watched long enough to see 
> if the same job hangs in the same place the second time around.  But 
> when I checked this evening I found that more than 1000 of my running 
> jobs were in this state.
>
> Here is the question: do we need to launch an auditor shell script to 
> watch the hd_ana process and kill it if it hangs?  I can do this, but 
> I am not sure if this is the recommended solution.  It seems like 
> there is already a watcher thread inside JANA that is printing the 
> error listed above to stderr, so would this be a watcher-watcher?  
> Then what if the watcher-watcher hangs?  I suppose one could have a 
> watch-watcher-watcher that watchers the watchering watch.  In case you 
> were wondering about that reference, Mark, that is Dr Seuss, a step up 
> from Seinfeld.
>
> -Richard J.**
> **
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/halld-offline/attachments/20121209/95571550/attachment.html