<html>
<head>
</head>
<body text="#000000" bgcolor="#ffffff">
David,<br>
<br>
Most of the grid jobs that we submitted over the weekend were hung
in mid-processing by hd_ana and had to be killed. The features of
these hangs all seem to be the same. There are three threads, which
I will call the main thread and two children. There is a deadlock
between the main thread and the second child, both in
"__lll_lock_wait_private()". I give the stack backtraces below.
The hd_ana binaries I am running are default i686 builds (i87 math,
no debug, default optimization) and so have no symbols, but by
poking around on the stack I can reconstruct what happened. The
error occurred in the second child thread, call to operator
new(unsigned int) which triggered a segfault (signal 11). The code
then entered the sighandler provided by root, which seems to be
opening a child process -- very funny behavior for a severe crash
recovery process, but anyway -- and deadlocks with the main thread
waiting for some event that never happens.<br>
<ol>
<li>We might want to rethink using the default root signal
handling mechanism, or replace it with something more
appropriate to the JANA framework. The root mechanism may not
be thread-safe, or if it is, there seems to be some interference
with the thread handling mechanism in JANA, which is causing
this lockup.</li>
<li>Then there is the root cause of the segfault in
DTrackCandidate_factory_CDC::FindThetaZRegression(). Have you
run into this before?</li>
</ol>
-Richard J.<br>
<br>
<br>
#0 0x40000402 in __kernel_vsyscall ()<br>
#1 0x008d9783 in __lll_lock_wait_private () from /lib/libc.so.6<br>
#2 0x00868a2a in _L_lock_43 () from /lib/libc.so.6<br>
#3 0x008618cb in ptmalloc_lock_all () from /lib/libc.so.6<br>
#4 0x0088cb2f in fork () from /lib/libc.so.6<br>
#5 0x00854bab in _IO_proc_open@@GLIBC_2.1 () from /lib/libc.so.6<br>
#6 0x00854e0a in <a href="mailto:popen@@GLIBC_2.1">popen@@GLIBC_2.1</a> () from /lib/libc.so.6<br>
#7 0x40220114 in TUnixSystem::OpenPipe(char const*, char const*) ()
from /usr/local/root/lib/libCore.so<br>
#8 0x40227060 in TUnixSystem::StackTrace() () from
/usr/local/root/lib/libCore.so<br>
#9 0x4022480e in TUnixSystem::DispatchSignals(ESignals) () from
/usr/local/root/lib/libCore.so<br>
#10 0x402248dd in SigHandler(ESignals) () from
/usr/local/root/lib/libCore.so<br>
#11 0x4021daa4 in sighandler(int) () from
/usr/local/root/lib/libCore.so<br>
#12 <signal handler called><br>
#13 0x00864e3f in _int_malloc () from /lib/libc.so.6<br>
#14 0x00866e97 in malloc () from /lib/libc.so.6<br>
#15 0x00d66ab7 in operator new(unsigned int) () from
/usr/lib/libstdc++.so.6<br>
#16 0x081706be in
DTrackCandidate_factory_CDC::FindThetaZRegression(DTrackCandidate_factory_CDC::DCDCSeed&)
()<br>
#17 0x08174c15 in
DTrackCandidate_factory_CDC::evnt(jana::JEventLoop*, int) ()<br>
#18 0x08157bff in
jana::JFactory<DTrackCandidate>::Get(std::vector<DTrackCandidate
const*, std::allocator<DTrackCandidate const*> >&) ()<br>
#19 0x08159918 in jana::JFactory<DTrackCandidate>*
jana::JEventLoop::GetFromFactory<DTrackCandidate>(std::vector<DTrackCandidate
const*, std::allocator<DTrackCandidate const*> >&, char
const*, jana::JEventLoop::data_source_t&) ()<br>
#20 0x0815e1e3 in jana::JFactory<DTrackCandidate>*
jana::JEventLoop::Get<DTrackCandidate>(std::vector<DTrackCandidate
const*, std::allocator<DTrackCandidate const*> >&, char
const*) ()<br>
#21 0x08166f5d in DTrackCandidate_factory::evnt(jana::JEventLoop*,
int) ()<br>
<br>
</body>
</html>