[Halld-offline] Storage and Analysis of Reconstructed Events

Matthew Shepherd mashephe at indiana.edu
Sat Feb 11 11:13:48 EST 2012

Hi all,

We agreed at the offline meeting last week there is a need to discuss storage schemes for data that has been reconstructed.  I think Paul suggested starting this discussion by email… so I'll give that a go.  Below are some salient features of the system we knew and loved at CLEO as they relate to some of the design decisions we need to make.  I know others (David and Richard in particular) have probably thought a lot about this, so I'm eager to hear thier input.  

We first have to decide what the starting point is for someone who wants to do an analysis, and I think the JANA framework should be starting point for "end user" physics analysis.  This is not something we want to attempt with a separate script or stand alone code.  Here are some reasons:

* JANA promotes modular code which has the ability to greatly enhance analysis productivity.  For example, one can write a factory to provide reconstructed eta' through all various decay channels and that factory can then be shared by anyone doing eta' analysis.  (This feature provided huge productivity gains in CLEO.)
* It is likely that some "post-reconstruction" database information will be needed.  One can imagine wanting to know beam conditions, tweak calibrations, etc.  JANA is really setup for this.
* People will want to do kinematic fits that are particular to their analysis or topology or known backgrounds.  The JANA framework and objects are easily setup for that.
* Finally, JANA will be familiar and robust -- better to learn and debug one system rather than many individual analysis systems.

There may be other reasons, but in my opinion we should toss out any starting point that requires that the user being with some list of four vectors and error matrices on a file on disk.  We should deliver the end user data to the user in the context of JANA which has nice objects and relationships that are setup to do the analysis tasks that the user wants to do.  This doesn't mean can't provide also a canned analysis package that just dumps four vectors to ROOT file for quick and dirty plotting.  We should try to make it as easy as possible for the user to write a JANA processor or plugin or whatever, to do their analysis job.  The results of this analysis job are then either written to some ntuple or histogram for fitting, cut tuning, etc.

** File Content**

The above requirement means that we need a file that is formatted and contains all of the relevant information to be able to reconstitute "high-level" analysis objects in JANA.  I imagine this means storing, for example, the showers, Kalman fitting output, etc. and then using that to rebuild objects in the PID library like DNeutralTrack, DTwoGammaFit (for pi^0s), DChargedTrack.  It is these high-level objects that can then be combined to form particles or kinematically fit by the user.

In CLEO we stored the bulk of data in some binary format that the collaboration developed.  The exact format is probably irrelevant but the system had a few nice features.  Any object in the analysis framework was storable if one wrote a "storage helper" that described how to pack and unpack it.  (The system did not require every object be storable -- in fact some objects we explicitly did not want the user to be able to store.)  When we did reconstruction (or if a user wanted to do some specialized skim) a simple command like:

output some_file.out { DNeutralTrack, DChargedTrack, …}

would store a whole list of objects to the file.  Of course there was a standard list for the initial reconstruction pass that did not include things like raw hits.  If one needed a specialized file for drift chamber calibration for example, one just alters the list of output objects to include the raw hits.  There was an ASCII header at the top of each file describing the objects (like HDDM).  This is very useful!  A simple 'more' or 'head' tells you exactly what is in the file.

In practice, most analyzers just read in the standard compressed reconstruction data, but the functionality was there to make all the raw data available for "experts."  In fact the system could read from and write to multiple files simultaneously with different lists of objects.  So in one pass one could write a "summary" file and a separate file that contained raw data for 1 of each 1000 events.  The former is the only one an analyzer needs.  But an expert could go back and load both files but synchronizing on the sparse file that contained raw data and read in the raw data and every 1000th event in the summary data file.

** File Indexing **

I highly recommend reading the following article about a EventStore, which, in principle, can be used directly with HDDM, ROOT, evio,:


Having random-access capability with data is incredibly valuable.  Many of the physics channels in GlueX will be relatively sparse -- remember that our events will be dominated by diffractive vector meson production.  Something like EventStore, would allow us to develop sets of sparse overlapping skims according to various criteria without actually replicating events.  To go with the eta' example above, one can imagine an eta' skim in which the EventStore is used to hop through all reconstructed events that have an eta' and access the data provided by some eta' reconstruction factory.  These skim choices are flexible and can evolve as our physics analysis goals evolve.  They are just collections of the same reconstructed events on disk.  If one wants to use the EventStore to write out a (new) skimmed file to export offsite for further analysis, that would also be trivial to do.

** Geting data from disk or from an algorithm **

The CLEO framework had the idea that data could either come from a factory or a source (EventStore for example).  If a factory was provided then the source was ignored.  The initial pass then enabled all the core reconstruction factories and wrote out the files.  The subsequent analysis pass (done by the user) did not load any of the core reconstruction factories and data was only provided from the files and high-level factories.  This scheme had the very useful feature that, in principle, some parts of the reconstruction could be redone at analysis time by simply loading the appropriate factories, provided that the data needed by these factories was also available.  In practice, this enabled analyses or studies "on the fringe" that would have otherwise been dismissed as impossible.  It wasn't the mainstream mode of operation (you don't want every analysis using specialized reconstruction routines), but it was a feature that was incredibly useful at times.

For example, suppose one is studying a channel where tracking algorithm is not quite optimized for some feature of the channel at the time of reconstruction.  Say it has multiple displaced vertices, low momentum tracks, etc.  If one can do a skim and to select some fraction of events that contain these characteristic events, then an improved tracking algorithm can be loaded, the raw data connected (via EventStore), and tracking can be redone easily only those events without changing any of the subsequent analysis code.  In practice, this probably involves staging raw data, which is some work, but the alternative (reprocessing the whole data set) may be impossible.

I'm sure others will add to this, and we can discuss it more over (a $28) coffee at the collaboration meeting.  Features like the EventStore, modular high-level reconstruction factories, and intuitive design of analysis objects greatly enhanced productivity at CLEO.  I think you would have a hard time finding a person in CLEO that will tell you their analysis efforts were hindered by the software, which I consider to be a success.  (People expect software to work perfectly and only complain when it doesn't.)


More information about the Halld-offline mailing list