[Halld-offline] non-reproducibility study

Matthew Shepherd mashephe at indiana.edu
Wed Mar 12 15:59:22 EDT 2014


Having just spent many frustrating hours hunting down my own separate non-deterministic bug I was motivated to run hd_dump -DTrackWireBased through valgrind.  

The error below seems suspicious and could result in non-determinstic behaviour, although valgrind is known to generate "errors" where there are none.  I didn't have time to look at the code since I have to run to another meeting, but thought I would pass it on.  

Matt


==7443== Conditional jump or move depends on uninitialised value(s)    
==7443==    at 0x88F769: DTrackFitterKalmanSIMD::KalmanForwardCDC(double, DMatrix5x1&, DMatrix5x5&, double&, unsigned int&) (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x8926C4: DTrackFitterKalmanSIMD::ForwardCDCFit(DMatrix5x1 const&, DMatrix5x5 const&) (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x89762A: DTrackFitterKalmanSIMD::KalmanLoop() (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x898315: DTrackFitterKalmanSIMD::FitTrack() (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x84B8CB: DTrackFitter::FindHitsAndFitTrack(DKinematicData const&, DReferenceTrajectory const*, jana::JEventLoop*, double, int, double, DetectorSystem_t) (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x8C25D7: DTrackWireBased_factory::DoFit(unsigned int, DTrackCandidate const*, DReferenceTrajectory*, jana::JEventLoop*, double) (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x8C4803: DTrackWireBased_factory::evnt(jana::JEventLoop*, int) (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x6C8F38: jana::JFactory<DTrackWireBased>::Get(std::vector<DTrackWireBased const*, std::allocator<DTrackWireBased const*> >&) (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x6C97EC: jana::JFactory<DTrackWireBased>* jana::JEventLoop::GetFromFactory<DTrackWireBased>(std::vector<DTrackWireBased const*, std::allocator<DTrackWireBased const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x6C9A84: jana::JFactory<DTrackWireBased>* jana::JEventLoop::Get<DTrackWireBased>(std::vector<DTrackWireBased const*, std::allocator<DTrackWireBased const*> >&, char const*, bool) (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x6CA153: jana::JFactory<DTrackWireBased>::GetNrows() (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)
==7443==    by 0x571A39: MyProcessor::evnt(jana::JEventLoop*, int) (in /home/fs1/mashephe/gluex/my_src/bin/Linux_CentOS6-x86_64-gcc4.4.6/hd_dump)


---------------------------------------------------------------------
Matthew Shepherd, Associate Professor
Department of Physics, Indiana University, Swain West 265
727 East Third Street, Bloomington, IN 47405

Office Phone:  +1 812 856 5808

On Mar 12, 2014, at 1:46 PM, Paul Mattione <pmatt at jlab.org> wrote:

> I just checked DTrackWireBased::t0() (single-threaded via saving to REST, hddm-xml, and diff).  The results are still sometimes inconsistent, and it happens on inconsistent event numbers.  
> 
> Sometimes it's the difference between a t0 of 0 and -0, although sometimes it's not.  However, in every case the t0err is 10.0, which means it's coming from the DTrackFitterKalmanSIMD::mT0MinimumDriftTime variable.  Also, most (but not all) of the events with these discrepancies have different values for t0det (CDC or FDC).  This means that DTrackFitterKalmanSIMD::mMinDriftID is inconsistent as well.  
> 
> I've attached a log file showing the results of running the diff command on one of the file pairs.  Hopefully this helps narrow it down some more.  
> 
> - Paul
> 
> <log.txt>
> 
> 
> On Mar 12, 2014, at 11:56 AM, Paul Mattione wrote:
> 
>> It looks like calling TObject::SetObjectStat(kFALSE) (say, in the DApplication constructor) will avoid using the TObjectTable for all TObjects, but I don't know what the side effects of this will be.  
>> 
>> - Paul
>> 
>> On Mar 12, 2014, at 9:35 AM, Simon Taylor wrote:
>> 
>>> On 03/11/2014 05:32 PM, Paul Mattione wrote:
>>>> 3) It looks like everything through wire-based tracking is OK, except DTrackWireBased::t0() is non-deterministic (the rest of the tracking parameters are fine).  This was discovered by saving the DTrackWireBased results to REST instead of DTrackTimeBased in our local test builds, then running hddm-xml and diff on the output (I agree this is the best way to go).
>>> 
>>> I have checked in a change to DTrackFitterKalmanSIMD that I believe may address this issue.  I changed how and where this t0 is computed slightly,
>>> although I do not understand why the previous method caused non-deterministic behavior.  Please let me know if the new version helps.
>>> 
>>>> 4) It looks like the track-BCAL/FCAL/TOF/SC matching is now deterministic (it wasn't before).  However, it looks like there are occasional differences when multithreading.   Perhaps this is due to problems with 3)??
>>>> 
>>>> 5) It looks like time-based tracking still has issues, which are worse when running multithreaded.  Perhaps these are related to 3)??
>>>> 
>>> 
>>> I have been looking into the multi-threading issues with valgrind. There are some indications that our use of TVector2, TVector3, and TMatrix in various parts of the tracking code are not thread safe because of an apparent use of global memory by TObject (from which all of the above classes inherit) through something called TObjectTable.   I have developed my own DVector2 and DVector3 classes to avoid this TObject dependency but
>>> it will take more effort to move away from TMatrix if this is what we want to do.
>>> 
>>> Simon
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
> 
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline





More information about the Halld-offline mailing list