[Halld-offline] Source of Inconsistency (This Case Fixed)
Paul Mattione
pmatt at jlab.org
Wed Mar 19 22:50:25 EDT 2014
I've found the source of the inconsistency we've been seeing: it's ultimately due to the fact that the associated-object map JObject::associated uses the object pointer as the key for the map. This means that the objects are stored in the map in a random order (they are stored in the order returned by the less-than-operator, which for pointers uses the (random) memory address). In the long run we may want to consider a modification to this, but for now, whenever you're getting associated objects, you need to be very careful that your results are not dependent on the order of the objects.
Fortunately, this is normally not the case. However, in DTrackCandidate_factory::MatchMethod4(), when the DFDCPseudo* hits associated with the DTrackCandidate* object are grabbed, their random order is a problem. In this case, there is a sort routine applied to these hits, FDCHitSortByLayerincreasing, which normally removes the randomness. However, when the hits are on the exact same layer and wire, they remain in a random order with respect to each other. There is a similar sort routine for the CDC, which has the same problem.
These FDC hits are added along with CDC hits to a DHelicalFit object which is used to try to link the hits together and get an improved estimate for the tracking parameters. Inside the fit function, DHelicalFit::FitLineRiemann(), there is another sort routine that sorts the hits by hit-z (DHFProjection_cmp). This still doesn't fix the problem though, since although the FDC hits on the same wire have different x & y, they have the same z. Thus when the linear-regression fit is performed, and the distances between the xy-projections are taken between hits, you get inconsistent results.
I've checked in modifications to these sort routines so that the results are more deterministic in case of ties (tie-breakers: in DTrackCandidate_factory, sort CDC and FDC hits by energy, and in DHelicalFit, sort by the projection distance from the beamline). All 32 of my single-threaded jobs analyzing this 5k-event file give identical results. This bug did not result from any uninitialized values, memory errors, from reading past the end of an array/vector, etc, and so was very difficult to track down. It hasn't been a fun few weeks, but hopefully most of the bugs have been stamped out now.
- Paul
More information about the Halld-offline
mailing list