[Halld-offline] hangups in mcsmear

Thu Feb 28 13:28:55 EST 2013

Hi Kei,

   This sounds a lot like the Data Challenge problem. Have a look slide
8 of my talk at the collaboration meeting:

http://argus.phys.uregina.ca/gluex/DocDB/0021/002173/001/20130222_TBD_lawrence.pdf

If this is caused by slow disk access, then this issue in mcsmear should 
have been fixed in revision 10269 that went in on Jan. 17. In this case, 
if you are using an older version of the code, there is a potential 
work-around: Set the THREAD_TIMEOUT configuration parameter to something 
larger than 30 seconds. This might give the network mounted disk enough 
time to catch up and started writing again.

It is also possible that this is due to database hangs like you suggest. 
(See bottom of slide 3 of Mark's talk here: 
http://argus.phys.uregina.ca/gluex/DocDB/0021/002171/002/offline_collab_2013-02.pdf). 
If this is the cause then you should fall back to using either the 
SQLite DB file with CCDB or the old flat file system which is still 
available as part of JANA.

-David

On 2/28/13 12:02 PM, Kei Moriya wrote:
> Dear offliners,
>
> I was running some jobs on our cluster just to see if I could
> mass-produce a large number of events, using hdgeant and mcsmear.
> I found that the hdgeant part worked for all 50 files that I
> submitted, but the mcsmear part tended to have a high failure
> rate of ~70-90%. The log files show that the programs were
> timing out after not being processed for more than 30 sec,
> and the annoying thing was that the cluster would report the
> jobs as running, whereas in reality the programs had already
> halted.
>
> The final part of the log files look like
> JANA ERROR>> Thread 0 hasn't responded in 30.5 seconds.
> (run:event=9000:2921) Cancelling ...
> JANA ERROR>>Caught HUP signal for thread 0x2b9a07b9e700 thread exiting...
> JANA ERROR>> Launching new thread ...
> JANA >>Merging thread 0 ...
>     2.9k events processed  (2.9k events read)  2.0Hz  (avg.: 36.3Hz)
>
> Does anybody know if this is because mcsmear is trying to access
> some outside database, and is hanging if there are too many jobs
> trying to access that file? And if so, is there a way out of this
> besides staggering the jobs and hoping that they don't start
> at the same time? Does this have anything to do with our failure
> rates during the data challenge?
>
> Thanks,
> 	Kei
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline