[Halld-offline] hangups in mcsmear

Thu Feb 28 13:41:06 EST 2013

Hi Dave,

Thanks for the clarification. I forgot to tell you that I was
working with the Jan 11 tag release, so my code missed your
fix by a few days. I could try working with the most recent tag
release and see if the problems persists. Maybe when running
a lot of mcsmear jobs I'll add in the THREAD_TIMEOUT flag to each
one. If nothing else works, I could try the other options.

Thanks,
	Kei

On 2/28/13 1:28 PM, David Lawrence wrote:
> Hi Kei,
>
>     This sounds a lot like the Data Challenge problem. Have a look slide
> 8 of my talk at the collaboration meeting:
>
> http://argus.phys.uregina.ca/gluex/DocDB/0021/002173/001/20130222_TBD_lawrence.pdf
>
> If this is caused by slow disk access, then this issue in mcsmear should
> have been fixed in revision 10269 that went in on Jan. 17. In this case,
> if you are using an older version of the code, there is a potential
> work-around: Set the THREAD_TIMEOUT configuration parameter to something
> larger than 30 seconds. This might give the network mounted disk enough
> time to catch up and started writing again.
>
> It is also possible that this is due to database hangs like you suggest.
> (See bottom of slide 3 of Mark's talk here:
> http://argus.phys.uregina.ca/gluex/DocDB/0021/002171/002/offline_collab_2013-02.pdf).
> If this is the cause then you should fall back to using either the
> SQLite DB file with CCDB or the old flat file system which is still
> available as part of JANA.
>
> -David
>
> On 2/28/13 12:02 PM, Kei Moriya wrote:
>> Dear offliners,
>>
>> I was running some jobs on our cluster just to see if I could
>> mass-produce a large number of events, using hdgeant and mcsmear.
>> I found that the hdgeant part worked for all 50 files that I
>> submitted, but the mcsmear part tended to have a high failure
>> rate of ~70-90%. The log files show that the programs were
>> timing out after not being processed for more than 30 sec,
>> and the annoying thing was that the cluster would report the
>> jobs as running, whereas in reality the programs had already
>> halted.
>>
>> The final part of the log files look like
>> JANA ERROR>> Thread 0 hasn't responded in 30.5 seconds.
>> (run:event=9000:2921) Cancelling ...
>> JANA ERROR>>Caught HUP signal for thread 0x2b9a07b9e700 thread exiting...
>> JANA ERROR>> Launching new thread ...
>> JANA >>Merging thread 0 ...
>>      2.9k events processed  (2.9k events read)  2.0Hz  (avg.: 36.3Hz)
>>
>> Does anybody know if this is because mcsmear is trying to access
>> some outside database, and is hanging if there are too many jobs
>> trying to access that file? And if so, is there a way out of this
>> besides staggering the jobs and hoping that they don't start
>> at the same time? Does this have anything to do with our failure
>> rates during the data challenge?
>>
>> Thanks,
>> 	Kei
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
>