[Halld-offline] diagnosis of cause for segfaults in DTrackCandidate_factory_CDC::FindThetaZRegression()
Curtis A. Meyer
cmeyer at ernest.phys.cmu.edu
Thu Mar 17 08:34:10 EDT 2011
Hi Richard -
that was an amazing piece of detective work! The question now becomes
what can we
do to our code to protect us from these sorts of issues? Clearly a good
point of discussion
in an offline meeting in the not-to-distant future. I guess a good
starting point would be to
ask where in the code can we get hit with this sort of issue (and have
it matter)?
curtis
On 3/16/11 11:02 PM, Richard Jones wrote:
> Dear colleagues,
>
> I have reproduced and diagnosed the segfaults that take place in the
> current GlueX reconstruction code, when compiled for the i686
> platform. Note that they also occur on 64bit hardware when running
> the 32bit executable, so it is not just a 32bit issue. The
> explanation is a bit too long for email, so I have written it up in
> the form of a wiki page. Please see it at the following URL.
>
> http://www.jlab.org/Hall-D/software/wiki/index.php/Diagnosing_segmentation_faults_in_reconstruction_software
>
>
> In that wiki page, I also explain why this should not be considered to
> be a compiler optimization bug, but rather a bug in our user code, in
> the context of x87 math. That, in spite of the fact that recompiling
> with -O0 seemed to solve it! In fact, turning off optimization is not
> a reliable solution, and the current bug probably will break out again
> in -O0 code in the near future, as g++ continues to evolve. What is
> more, in considering the impact of this bug, the segfault is really
> only the tip of the iceberg. I would expect this problem to be
> happening much more often in -m32 builds, but only showing up as
> segfaults in the (rare?) case that the memory between the valid data
> and the end of the valid data segment contains all zeros. In what
> might be the more normal occurrance of this bug, we could be getting
> bogus results from the tracking and not know it. In other words, the
> segfault is your friend.
>
> Besides this, there is the more serious issue of how robust the rest
> of the code is against what I might call the "x87 entropy problem"
> with randomly fluctuating least-significant bits in doubles. This
> probably warrants a broader discussion, beyond the resolution of this
> particular bug.
>
> -Richard J.
>
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
--
Prof. Curtis A. Meyer Department of Physics
Phone: (412) 268-2745 Carnegie Mellon University
Fax: (412) 681-0648 Pittsburgh PA 15213-3890
cmeyer at ernest.phys.cmu.edu http://www.curtismeyer.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20110317/fb08604a/attachment-0002.html>
More information about the Halld-offline
mailing list