<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
Hi Richard,<br>
<br>
The DTrackCandidate_factory_CDC::FindThetaZRegression() has come
up before when Kei and Jake reported problems with DANA programs
hanging as far back as December. This led to the "fix" currently
used (though not incorporated into the build system) where
optimization is turned off when compiling
DTrackCandidate_factory_CDC.cc. As of yet, we have not been able to
identify the bug exactly as the behavior is not deterministic.<br>
<br>
I will take another look at this today to see if I can make some
more headway on the problem.<br>
<br>
Regards,<br>
-Dave<br>
<br>
On 3/14/11 7:50 AM, Richard Jones wrote:
<blockquote cite="mid:4D7E0115.1040904@uconn.edu" type="cite">
David,<br>
<br>
Most of the grid jobs that we submitted over the weekend were hung
in mid-processing by hd_ana and had to be killed. The features of
these hangs all seem to be the same. There are three threads,
which I will call the main thread and two children. There is a
deadlock between the main thread and the second child, both in
"__lll_lock_wait_private()". I give the stack backtraces below.
The hd_ana binaries I am running are default i686 builds (i87
math, no debug, default optimization) and so have no symbols, but
by poking around on the stack I can reconstruct what happened.
The error occurred in the second child thread, call to operator
new(unsigned int) which triggered a segfault (signal 11). The
code then entered the sighandler provided by root, which seems to
be opening a child process -- very funny behavior for a severe
crash recovery process, but anyway -- and deadlocks with the main
thread waiting for some event that never happens.<br>
<ol>
<li>We might want to rethink using the default root signal
handling mechanism, or replace it with something more
appropriate to the JANA framework. The root mechanism may not
be thread-safe, or if it is, there seems to be some
interference with the thread handling mechanism in JANA, which
is causing this lockup.</li>
<li>Then there is the root cause of the segfault in
DTrackCandidate_factory_CDC::FindThetaZRegression(). Have you
run into this before?</li>
</ol>
-Richard J.<br>
<br>
<br>
#0 0x40000402 in __kernel_vsyscall ()<br>
#1 0x008d9783 in __lll_lock_wait_private () from /lib/libc.so.6<br>
#2 0x00868a2a in _L_lock_43 () from /lib/libc.so.6<br>
#3 0x008618cb in ptmalloc_lock_all () from /lib/libc.so.6<br>
#4 0x0088cb2f in fork () from /lib/libc.so.6<br>
#5 0x00854bab in _IO_proc_open@@GLIBC_2.1 () from /lib/libc.so.6<br>
#6 0x00854e0a in <a moz-do-not-send="true"
href="mailto:popen@@GLIBC_2.1">popen@@GLIBC_2.1</a> () from
/lib/libc.so.6<br>
#7 0x40220114 in TUnixSystem::OpenPipe(char const*, char const*)
() from /usr/local/root/lib/libCore.so<br>
#8 0x40227060 in TUnixSystem::StackTrace() () from
/usr/local/root/lib/libCore.so<br>
#9 0x4022480e in TUnixSystem::DispatchSignals(ESignals) () from
/usr/local/root/lib/libCore.so<br>
#10 0x402248dd in SigHandler(ESignals) () from
/usr/local/root/lib/libCore.so<br>
#11 0x4021daa4 in sighandler(int) () from
/usr/local/root/lib/libCore.so<br>
#12 <signal handler called><br>
#13 0x00864e3f in _int_malloc () from /lib/libc.so.6<br>
#14 0x00866e97 in malloc () from /lib/libc.so.6<br>
#15 0x00d66ab7 in operator new(unsigned int) () from
/usr/lib/libstdc++.so.6<br>
#16 0x081706be in
DTrackCandidate_factory_CDC::FindThetaZRegression(DTrackCandidate_factory_CDC::DCDCSeed&)
()<br>
#17 0x08174c15 in
DTrackCandidate_factory_CDC::evnt(jana::JEventLoop*, int) ()<br>
#18 0x08157bff in
jana::JFactory<DTrackCandidate>::Get(std::vector<DTrackCandidate
const*, std::allocator<DTrackCandidate const*> >&) ()<br>
#19 0x08159918 in jana::JFactory<DTrackCandidate>*
jana::JEventLoop::GetFromFactory<DTrackCandidate>(std::vector<DTrackCandidate
const*, std::allocator<DTrackCandidate const*> >&,
char const*, jana::JEventLoop::data_source_t&) ()<br>
#20 0x0815e1e3 in jana::JFactory<DTrackCandidate>*
jana::JEventLoop::Get<DTrackCandidate>(std::vector<DTrackCandidate
const*, std::allocator<DTrackCandidate const*> >&,
char const*) ()<br>
#21 0x08166f5d in DTrackCandidate_factory::evnt(jana::JEventLoop*,
int) ()<br>
<br>
</blockquote>
</body>
</html>