[Halld-offline] hangups in mcsmear

Kei Moriya kmoriya at indiana.edu
Thu Feb 28 12:02:07 EST 2013


Dear offliners,

I was running some jobs on our cluster just to see if I could
mass-produce a large number of events, using hdgeant and mcsmear.
I found that the hdgeant part worked for all 50 files that I
submitted, but the mcsmear part tended to have a high failure
rate of ~70-90%. The log files show that the programs were
timing out after not being processed for more than 30 sec,
and the annoying thing was that the cluster would report the
jobs as running, whereas in reality the programs had already
halted.

The final part of the log files look like
JANA ERROR>> Thread 0 hasn't responded in 30.5 seconds. 
(run:event=9000:2921) Cancelling ...
JANA ERROR>>Caught HUP signal for thread 0x2b9a07b9e700 thread exiting...
JANA ERROR>> Launching new thread ...
JANA >>Merging thread 0 ...
   2.9k events processed  (2.9k events read)  2.0Hz  (avg.: 36.3Hz)

Does anybody know if this is because mcsmear is trying to access
some outside database, and is hanging if there are too many jobs
trying to access that file? And if so, is there a way out of this
besides staggering the jobs and hoping that they don't start
at the same time? Does this have anything to do with our failure
rates during the data challenge?

Thanks,
	Kei



More information about the Halld-offline mailing list