[Halld-offline] diagnosis of cause for segfaults in DTrackCandidate_factory_CDC::FindThetaZRegression()

David Lawrence davidl at jlab.org
Wed Mar 16 23:30:25 EDT 2011


Hi Richard,

     This is a nice piece of detective work. I believe there is a 
violation of the Wolin principle of least astonishment in there 
somewhere since one of the main advantages of working in a high level 
language is to NOT have to worry about the hardware details underneath. 
This does seem to indicate we can't remain so blissfully ignorant of 
those details.

     Thanks for tracking this down.

Regards,
-David

On 3/16/11 11:02 PM, Richard Jones wrote:
> Dear colleagues,
>
> I have reproduced and diagnosed the segfaults that take place in the 
> current GlueX reconstruction code, when compiled for the i686 
> platform.  Note that they also occur on 64bit hardware when running 
> the 32bit executable, so it is not just a 32bit issue.  The 
> explanation is a bit too long for email, so I have written it up in 
> the form of a wiki page.  Please see it at the following URL.
>
> http://www.jlab.org/Hall-D/software/wiki/index.php/Diagnosing_segmentation_faults_in_reconstruction_software 
>
>
> In that wiki page, I also explain why this should not be considered to 
> be a compiler optimization bug, but rather a bug in our user code, in 
> the context of x87 math.  That, in spite of the fact that recompiling 
> with -O0 seemed to solve it!  In fact, turning off optimization is not 
> a reliable solution, and the current bug probably will break out again 
> in -O0 code in the near future, as g++ continues to evolve.  What is 
> more, in considering the impact of this bug, the segfault is really 
> only the tip of the iceberg.  I would expect this problem to be 
> happening much more often in -m32 builds, but only showing up as 
> segfaults in the (rare?) case that the memory between the valid data 
> and the end of the valid data segment contains all zeros.  In what 
> might be the more normal occurrance of this bug, we could be getting 
> bogus results from the tracking and not know it.  In other words, the 
> segfault is your friend.
>
> Besides this, there is the more serious issue of how robust the rest 
> of the code is against what I might call the "x87 entropy problem" 
> with randomly fluctuating least-significant bits in doubles.  This 
> probably warrants a broader discussion, beyond the resolution of this 
> particular bug.
>
> -Richard J.
>
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20110316/5e28f69d/attachment-0002.html>


More information about the Halld-offline mailing list