[Halld-offline] hd_ana hanging after variable number of bggen events
Richard Jones
richard.t.jones at uconn.edu
Mon Mar 14 07:50:45 EDT 2011
David,
Most of the grid jobs that we submitted over the weekend were hung in mid-processing by hd_ana and had to be killed. The features of these hangs all seem to be the same. There are three threads, which I will call the main thread and two children. There is a deadlock between the main thread and the second child, both in "__lll_lock_wait_private()". I give the stack backtraces below. The hd_ana binaries I am running are default i686 builds (i87 math, no debug, default optimization) and so have no symbols, but by poking around on the stack I can reconstruct what happened. The error occurred in the second child thread, call to operator new(unsigned int) which triggered a segfault (signal 11). The code then entered the sighandler provided by root, which seems to be opening a child process -- very funny behavior for a severe crash recovery process, but anyway -- and deadlocks with the main thread waiting for some event that never happens.
1. We might want to rethink using the default root signal handling mechanism, or replace it with something more appropriate to the JANA framework. The root mechanism may not be thread-safe, or if it is, there seems to be some interference with the thread handling mechanism in JANA, which is causing this lockup.
2. Then there is the root cause of the segfault in DTrackCandidate_factory_CDC::FindThetaZRegression(). Have you run into this before?
-Richard J.
#0 0x40000402 in __kernel_vsyscall ()
#1 0x008d9783 in __lll_lock_wait_private () from /lib/libc.so.6
#2 0x00868a2a in _L_lock_43 () from /lib/libc.so.6
#3 0x008618cb in ptmalloc_lock_all () from /lib/libc.so.6
#4 0x0088cb2f in fork () from /lib/libc.so.6
#5 0x00854bab in _IO_proc_open@@GLIBC_2.1 () from /lib/libc.so.6
#6 0x00854e0a in popen@@GLIBC_2.1 () from /lib/libc.so.6
#7 0x40220114 in TUnixSystem::OpenPipe(char const*, char const*) () from /usr/local/root/lib/libCore.so
#8 0x40227060 in TUnixSystem::StackTrace() () from /usr/local/root/lib/libCore.so
#9 0x4022480e in TUnixSystem::DispatchSignals(ESignals) () from /usr/local/root/lib/libCore.so
#10 0x402248dd in SigHandler(ESignals) () from /usr/local/root/lib/libCore.so
#11 0x4021daa4 in sighandler(int) () from /usr/local/root/lib/libCore.so
#12 <signal handler called>
#13 0x00864e3f in _int_malloc () from /lib/libc.so.6
#14 0x00866e97 in malloc () from /lib/libc.so.6
#15 0x00d66ab7 in operator new(unsigned int) () from /usr/lib/libstdc++.so.6
#16 0x081706be in DTrackCandidate_factory_CDC::FindThetaZRegression(DTrackCandidate_factory_CDC::DCDCSeed&) ()
#17 0x08174c15 in DTrackCandidate_factory_CDC::evnt(jana::JEventLoop*, int) ()
#18 0x08157bff in jana::JFactory<DTrackCandidate>::Get(std::vector<DTrackCandidate const*, std::allocator<DTrackCandidate const*> >&) ()
#19 0x08159918 in jana::JFactory<DTrackCandidate>* jana::JEventLoop::GetFromFactory<DTrackCandidate>(std::vector<DTrackCandidate const*, std::allocator<DTrackCandidate const*> >&, char const*, jana::JEventLoop::data_source_t&) ()
#20 0x0815e1e3 in jana::JFactory<DTrackCandidate>* jana::JEventLoop::Get<DTrackCandidate>(std::vector<DTrackCandidate const*, std::allocator<DTrackCandidate const*> >&, char const*) ()
#21 0x08166f5d in DTrackCandidate_factory::evnt(jana::JEventLoop*, int) ()
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20110314/20120c12/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4092 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20110314/20120c12/attachment.p7s>
More information about the Halld-offline
mailing list