[Halld-offline] hd_ana hanging after variable number of bggen events
David Lawrence
davidl at jlab.org
Tue Mar 15 09:21:52 EDT 2011
Hi Richard,
I am still working to reproduce this problem (again). JLab has
recently retired all of it's 32bit machines and I believe this problem
has only presented in 32bit environments. At least I was only able to
get it to in 32bits and not in 64bits. (If anyone has a different
experience, please let me know).
As such, I'm resorting to working in a Virtual Machine which has
got me thinking. I know there was a lot of talk at CHEP a few years ago
about how virtualization was going to be very useful on the GRID because
one could more or less bring the whole OS with you and therefore be
independent of the actual hardware. Do you know if this is currently
being implemented in GRID activities? It would sure make it a whole lot
easier for other people to help track bugs down if they could just grab
the VM. I'm just curious how feasible that really is.
Regards,
-David
On 3/14/11 7:50 AM, Richard Jones wrote:
> David,
>
> Most of the grid jobs that we submitted over the weekend were hung in
> mid-processing by hd_ana and had to be killed. The features of these
> hangs all seem to be the same. There are three threads, which I will
> call the main thread and two children. There is a deadlock between
> the main thread and the second child, both in
> "__lll_lock_wait_private()". I give the stack backtraces below. The
> hd_ana binaries I am running are default i686 builds (i87 math, no
> debug, default optimization) and so have no symbols, but by poking
> around on the stack I can reconstruct what happened. The error
> occurred in the second child thread, call to operator new(unsigned
> int) which triggered a segfault (signal 11). The code then entered
> the sighandler provided by root, which seems to be opening a child
> process -- very funny behavior for a severe crash recovery process,
> but anyway -- and deadlocks with the main thread waiting for some
> event that never happens.
>
> 1. We might want to rethink using the default root signal handling
> mechanism, or replace it with something more appropriate to the
> JANA framework. The root mechanism may not be thread-safe, or
> if it is, there seems to be some interference with the thread
> handling mechanism in JANA, which is causing this lockup.
> 2. Then there is the root cause of the segfault in
> DTrackCandidate_factory_CDC::FindThetaZRegression(). Have you
> run into this before?
>
> -Richard J.
>
>
> #0 0x40000402 in __kernel_vsyscall ()
> #1 0x008d9783 in __lll_lock_wait_private () from /lib/libc.so.6
> #2 0x00868a2a in _L_lock_43 () from /lib/libc.so.6
> #3 0x008618cb in ptmalloc_lock_all () from /lib/libc.so.6
> #4 0x0088cb2f in fork () from /lib/libc.so.6
> #5 0x00854bab in _IO_proc_open@@GLIBC_2.1 () from /lib/libc.so.6
> #6 0x00854e0a in popen@@GLIBC_2.1 <mailto:popen@@GLIBC_2.1> () from
> /lib/libc.so.6
> #7 0x40220114 in TUnixSystem::OpenPipe(char const*, char const*) ()
> from /usr/local/root/lib/libCore.so
> #8 0x40227060 in TUnixSystem::StackTrace() () from
> /usr/local/root/lib/libCore.so
> #9 0x4022480e in TUnixSystem::DispatchSignals(ESignals) () from
> /usr/local/root/lib/libCore.so
> #10 0x402248dd in SigHandler(ESignals) () from
> /usr/local/root/lib/libCore.so
> #11 0x4021daa4 in sighandler(int) () from /usr/local/root/lib/libCore.so
> #12 <signal handler called>
> #13 0x00864e3f in _int_malloc () from /lib/libc.so.6
> #14 0x00866e97 in malloc () from /lib/libc.so.6
> #15 0x00d66ab7 in operator new(unsigned int) () from
> /usr/lib/libstdc++.so.6
> #16 0x081706be in
> DTrackCandidate_factory_CDC::FindThetaZRegression(DTrackCandidate_factory_CDC::DCDCSeed&)
> ()
> #17 0x08174c15 in DTrackCandidate_factory_CDC::evnt(jana::JEventLoop*,
> int) ()
> #18 0x08157bff in
> jana::JFactory<DTrackCandidate>::Get(std::vector<DTrackCandidate
> const*, std::allocator<DTrackCandidate const*> >&) ()
> #19 0x08159918 in jana::JFactory<DTrackCandidate>*
> jana::JEventLoop::GetFromFactory<DTrackCandidate>(std::vector<DTrackCandidate
> const*, std::allocator<DTrackCandidate const*> >&, char const*,
> jana::JEventLoop::data_source_t&) ()
> #20 0x0815e1e3 in jana::JFactory<DTrackCandidate>*
> jana::JEventLoop::Get<DTrackCandidate>(std::vector<DTrackCandidate
> const*, std::allocator<DTrackCandidate const*> >&, char const*) ()
> #21 0x08166f5d in DTrackCandidate_factory::evnt(jana::JEventLoop*, int) ()
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20110315/83858f93/attachment-0002.html>
More information about the Halld-offline
mailing list