[Halld-offline] mcsmear hangs
David Lawrence
davidl at jlab.org
Wed Dec 19 22:36:52 EST 2012
Hi Richard et al.,
I have captured a mcsmear hang at the point of failure running on an
ifarm machine at JLab.
I've cut and pasted the relevant info into a file that is copied below.
As previously reported,
the cause of the hang is due to a 30 second timeout being exceeded in
processing a
single event. The processing thread is signaled to quit and it does so,
but dies while still
holding a mutex lock. The replacement thread tries locking this mutex,
but is blocked since
the now-dead-thread still holds it. This results in the deadlock
situation of the
process being hung with 0% CPU usage.
The mystery as to why an event sometimes needs more than 30 seconds
to process and write
out remains. This is not a reproducible behavior. Running mcsmear
repeatedly on the same
input file with the same random number seeds produces identical output,
but does not always
result in a hang. When it does, it does not ever seem to be on the same
event indicating it is
probably not due to just a very busy event taking a long time. My best
guess is that it may
be a result of a slowdown on the network used by the mounted filesystem
that the event is
being written to. This is not consistent with Mark's report of seeing
the same hangs on the
JLab farm jobs while they are writing to the local disk.
Regardless of the cause for the slow events, the deadlock situation
should be fixed since it will
always be a possibility with the current code. I am currently trying to
implement a possible fix
for that problem that requires replacing the HUP signal handler
installed by JANA with one that
tries to unlock the mutex. The mutex itself also has to be changed to
ERRORCHECKING type so
unlocking it if it is not locked by us will not result in a seg. fault.
Regards,
-David
These came from a hanging event captured under the
following circumstances:
1. Running on ifarm1101. The command was:
nice +10 mcsmear_d -PPLUGINS=janarate -PRATE:PERIOD_EVENTS=1
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm hdgeant.hddm
hdgeant.hddm hdgeant.hddm hdgeant.hddm
2. After event processing started, gdb was attached
to the process in a different window. It was done
this way because gdb seems to get a long stream of
SIGTOU signals when the program is started from
within it. Attaching to the PID externally seems to
bypass that
3. A break point was set on thread_HUP_sighandler
and the process allowed to continue
The last bit of output on the screen before it hung was:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - -
JANA >>
JANA >> --- Configuration Parameters --
JANA >> PLUGINS = janarate
JANA >> RATE:PERIOD_EVENTS = 1
JANA >> THREAD_TIMEOUT = 30 seconds
JANA >> -------------------------------
JANA ERROR>> didn't sleep full 0.5 seconds!d) 68.0Hz (avg.: 62.7Hz)
JANA >>Opening source "hdgeant.hddm" of type: HDDM0Hz (avg.: 61.7Hz)
JANA >>Opening source "hdgeant.hddm" of type: HDDM4.0Hz (avg.: 61.5Hz)
JANA ERROR>> didn't sleep full 0.5 seconds!ead) 0.0Hz (avg.: 61.7Hz)
JANA >>Opening source "hdgeant.hddm" of type: HDDM.0Hz (avg.: 61.8Hz)
JANA >>Opening source "hdgeant.hddm" of type: HDDM0.0Hz (avg.: 61.9Hz)
JANA ERROR>> didn't sleep full 0.5 seconds!ead) 70.0Hz (avg.: 61.9Hz)
JANA ERROR>> didn't sleep full 0.5 seconds!ead) 70.0Hz (avg.: 61.9Hz)
JANA >>Opening source "hdgeant.hddm" of type: HDDM2.0Hz (avg.: 61.8Hz)
JANA >>Opening source "hdgeant.hddm" of type: HDDM2.0Hz (avg.: 61.8Hz)
JANA >>Opening source "hdgeant.hddm" of type: HDDM2.0Hz (avg.: 61.9Hz)
JANA ERROR>> Thread 0 hasn't responded in 30 seconds.
(run:event=9000:39276) Cancelling ...
JANA ERROR>> Thread 0 hasn't responded in 30.5 seconds.
(run:event=9000:39276) Cancelling ...
JANA ERROR>> Thread 0 hasn't responded in 31 seconds.
(run:event=9000:39276) Cancelling ...
JANA ERROR>> Thread 0 hasn't responded in 31.5 seconds.
(run:event=9000:39276) Cancelling ...
JANA ERROR>> Thread 0 hasn't responded in 32 seconds.
(run:event=9000:39276) Cancelling ...
JANA ERROR>> Thread 0 hasn't responded in 32.5 seconds.
(run:event=9000:39276) Cancelling ...
JANA ERROR>> Thread 0 hasn't responded in 33 seconds.
(run:event=9000:39276) Cancelling ...
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - -
The following is info from gdb
--------------------------------
=============================================================
(gdb) info threads
3 Thread 0x41ed4940 (LWP 2551) 0x0000003fd520a899 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 2 Thread 0x428d5940 (LWP 2552) 0x0000003fd46c55ab in write ()
from /lib64/libc.so.6
1 Thread 0x2b4e788bee80 (LWP 2509) 0x0000003fd520db71 in nanosleep ()
from /lib64/libpthread.so.0
=============================================================
(gdb) thread 1
[Switching to thread 1 (Thread 0x2b4e788bee80 (LWP 2509))]#0
0x0000003fd520db71 in nanosleep () from /lib64/libpthread.so.0
(gdb) bt
#0 0x0000003fd520db71 in nanosleep () from /lib64/libpthread.so.0
#1 0x0000000000923554 in jana::JApplication::Run (this=0x7fff3a5d9970,
proc=0x7fff3a5d9de0, Nthreads=1) at JApplication.cc:1233
#2 0x00000000008137c1 in main (narg=61, argv=0x7fff3a5da268) at
mcsmear.cc:345
=============================================================
(gdb) thread 2
[Switching to thread 2 (Thread 0x428d5940 (LWP 2552))]#0
0x0000003fd46c55ab in write () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003fd46c55ab in write () from /lib64/libc.so.6
#1 0x0000003fd466ba63 in _IO_new_file_write () from /lib64/libc.so.6
#2 0x0000003fd466b976 in _IO_new_do_write () from /lib64/libc.so.6
#3 0x0000003fd466d4f9 in _IO_switch_to_get_mode_internal ()
from /lib64/libc.so.6
#4 0x0000003fd466bc58 in _IO_new_file_seekoff () from /lib64/libc.so.6
#5 0x0000003fd466a2c5 in ftello64 () from /lib64/libc.so.6
#6 0x00000000009063fc in xdr_getpos64 (xdrs=0x113a92a0) at hddm_s.c:1561
#7 0x0000000000912a84 in pack_s_FdcCathodeHits (xdrs=0x113a92a0,
this1=0x2aaaac16ccd0) at hddm_s.c:6570
#8 0x0000000000912951 in pack_s_FdcCathodeStrips (xdrs=0x113a92a0,
this1=0x19e96090) at hddm_s.c:6535
#9 0x000000000091224a in pack_s_FdcChambers (xdrs=0x113a92a0,
this1=0x1a13f8e0) at hddm_s.c:6389
#10 0x0000000000912092 in pack_s_ForwardDC (xdrs=0x113a92a0,
this1=0x1a13f8c0)
at hddm_s.c:6343
#11 0x00000000009113b3 in pack_s_HitView (xdrs=0x113a92a0, this1=0x1a13e780)
at hddm_s.c:6015
#12 0x00000000009104e5 in pack_s_PhysicsEvents (xdrs=0x113a92a0,
this1=0x1a13e4c0) at hddm_s.c:5619
#13 0x000000000091022f in pack_s_HDDM (xdrs=0x113a92a0, this1=0x1a13e4a0)
at hddm_s.c:5547
#14 0x00000000009191f4 in flush_s_HDDM (this1=0x1a13e4a0, fp=0x145efd50)
at hddm_s.c:8767
#15 0x000000000081944b in MyProcessor::evnt (this=0x7fff3a5d9de0,
loop=0x12624340, eventnumber=39276) at MyProcessor.cc:70
#16 0x000000000095ad70 in jana::JEventLoop::OneEvent (this=0x12624340)
at JEventLoop.cc:537
#17 0x000000000095b2cb in jana::JEventLoop::Loop (this=0x12624340)
at JEventLoop.cc:411
#18 0x0000000000924c18 in LaunchThread (arg=0x7fff3a5d9970)
at JApplication.cc:1028
#19 0x0000003fd5206367 in start_thread () from /lib64/libpthread.so.0
#20 0x0000003fd46d2f7d in clone () from /lib64/libc.so.6
=============================================================
(gdb) thread 3
[Switching to thread 3 (Thread 0x41ed4940 (LWP 2551))]#0
0x0000003fd520a899 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
(gdb) bt
#0 0x0000003fd520a899 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
#1 0x00000000009260fc in jana::JApplication::EventBufferThread (
this=0x7fff3a5d9970) at JApplication.cc:528
#2 0x00000000009262bb in LaunchEventBufferThread (arg=0x7fff3a5d9970)
at JApplication.cc:492
#3 0x0000003fd5206367 in start_thread () from /lib64/libpthread.so.0
#4 0x0000003fd46d2f7d in clone () from /lib64/libc.so.6
More information about the Halld-offline
mailing list