[Halld-offline] Storage and Analysis of Reconstructed Events

Sat Feb 11 19:22:34 EST 2012

Hi Eugene,

We did take data at different center of mass energies.  Regarding event sizes and data set sizes, it depends somewhat on the energy and backgrounds.  The numbers below are probably the right order of magnitude.  More precise numbers are obtainable, but would take some work.

For the three different major center of mass energies in CLEO-c program the total number of raw events acquired was probably around 2E9.  Raw event sizes vary (due to track multiplicity) but an average is about 10 kB per event.  This means about 20 TB of raw data.  Analyzed event size is in the neighborhood of 50 kB per event.  The "skim" at the end, which is what someone doing analysis would use, is about in the 2-5 kB range.  Skimmed Monte Carlo is at least twice as big per event because it has to hold all the generated information.  The full data and some very large inclusive MC skims are in the neighborhood of tens of TB.

It is only these skims that users will use for analysis.  The raw data skims were sorted into very general event types (mainly to try to weed out QED events that had no hadrons in the final state).  The EventStore would deliver each event in this collection for a specified center of mass energy to the analysis job.

For the analysis you suggest, we had a "DTag" process that would make a pass through roughly 1E9 reconstructed and "skimmed" 5 kB events collected at 3770 MeV center of mass energy in order to reconstruct many possible D decay modes.  (This is a good example of a high-level user analysis job written in the same framework that was used to do reconstruction.)  This process ("factory" if you want to call it that) would create an object (stored in a file) that contained only the D decay information -- almost all D analyses used this output in conjunction with the skim data to do subsequent analyses.  

The D tag skim provided run and event numbers of events that contained a D along with the supplemental D reconstruction information.  The framework could simultaneously deliver the remainder of the data that was in the high level reconstruction skim with each event.  Its nice because you get a reduced number of events with enhanced information without the cost of duplicating the initial skim output.  I don't have the exact number, but this subset of events that contained a D is probably around 1% of the large skim sample, maybe ten million events.  The supplemental D decay info is at the level of hundreds of GB.  

For the D Dalitz analysis, one would need to write an analysis job that ran over this skim, picked out the particular decay mode, and fill the Dalitz plot variables in some ntuple for analysis.  The great thing is that someone doing an entirely different analysis with D's could easily reuse the skim.  (In many cases people were interested in the "other D" -- there are two per event -- and the tag is used to ensure that another D exists and constrain its four momentum.)

While event size and detector complexity are probably comparable to GlueX, my instinct is that the overall CLEO numbers of events are probably down by a couple orders of magnitude.  However, I don't see any obvious trouble scaling the CLEO-style approach up to GlueX.  The centralized skim database makes it so easy for someone to come in and target the particular types of events they are interested in analyzing.

Hope this provides some of the info you were looking for,

Matt

On Feb 11, 2012, at 3:02 PM, Eugene Chudakov wrote:

> Hi Matt,
> 
> A few questions concerning CLEO:
> 
> I suppose there were several experiments at different energies.  Let
> us assume you were interested in a Dalitz plot of a common decay of a
> D, and used all the data where this D could be identified. How many
> events of what average size were involved in the analysis at different
> stages: event reconstruction, filtering, final analysis?
> 
> Thanks,
> Eugene
> 
> On Sat, 11 Feb 2012, Matthew Shepherd wrote:
> 
>> 
>> Hi all,
>> 
>> We agreed at the offline meeting last week there is a need to discuss storage schemes for data that has been reconstructed.  I think Paul suggested starting this discussion by email… so I'll give that a go.  Below are some salient features of the system we knew and loved at CLEO as they relate to some of the design decisions we need to make.  I know others (David and Richard in particular) have probably thought a lot about this, so I'm eager to hear thier input.
>> 
>> We first have to decide what the starting point is for someone who wants to do an analysis, and I think the JANA framework should be starting point for "end user" physics analysis.  This is not something we want to attempt with a separate script or stand alone code.  Here are some reasons:
>> 
>> * JANA promotes modular code which has the ability to greatly enhance analysis productivity.  For example, one can write a factory to provide reconstructed eta' through all various decay channels and that factory can then be shared by anyone doing eta' analysis.  (This feature provided huge productivity gains in CLEO.)
>> * It is likely that some "post-reconstruction" database information will be needed.  One can imagine wanting to know beam conditions, tweak calibrations, etc.  JANA is really setup for this.
>> * People will want to do kinematic fits that are particular to their analysis or topology or known backgrounds.  The JANA framework and objects are easily setup for that.
>> * Finally, JANA will be familiar and robust -- better to learn and debug one system rather than many individual analysis systems.
>> 
>> There may be other reasons, but in my opinion we should toss out any starting point that requires that the user being with some list of four vectors and error matrices on a file on disk.  We should deliver the end user data to the user in the context of JANA which has nice objects and relationships that are setup to do the analysis tasks that the user wants to do.  This doesn't mean can't provide also a canned analysis package that just dumps four vectors to ROOT file for quick and dirty plotting.  We should try to make it as easy as possible for the user to write a JANA processor or plugin or whatever, to do their analysis job.  The results of this analysis job are then either written to some ntuple or histogram for fitting, cut tuning, etc.
>> 
>> ** File Content**
>> 
>> The above requirement means that we need a file that is formatted and contains all of the relevant information to be able to reconstitute "high-level" analysis objects in JANA.  I imagine this means storing, for example, the showers, Kalman fitting output, etc. and then using that to rebuild objects in the PID library like DNeutralTrack, DTwoGammaFit (for pi^0s), DChargedTrack.  It is these high-level objects that can then be combined to form particles or kinematically fit by the user.
>> 
>> In CLEO we stored the bulk of data in some binary format that the collaboration developed.  The exact format is probably irrelevant but the system had a few nice features.  Any object in the analysis framework was storable if one wrote a "storage helper" that described how to pack and unpack it.  (The system did not require every object be storable -- in fact some objects we explicitly did not want the user to be able to store.)  When we did reconstruction (or if a user wanted to do some specialized skim) a simple command like:
>> 
>> output some_file.out { DNeutralTrack, DChargedTrack, …}
>> 
>> would store a whole list of objects to the file.  Of course there was a standard list for the initial reconstruction pass that did not include things like raw hits.  If one needed a specialized file for drift chamber calibration for example, one just alters the list of output objects to include the raw hits.  There was an ASCII header at the top of each file describing the objects (like HDDM).  This is very useful!  A simple 'more' or 'head' tells you exactly what is in the file.
>> 
>> In practice, most analyzers just read in the standard compressed reconstruction data, but the functionality was there to make all the raw data available for "experts."  In fact the system could read from and write to multiple files simultaneously with different lists of objects.  So in one pass one could write a "summary" file and a separate file that contained raw data for 1 of each 1000 events.  The former is the only one an analyzer needs.  But an expert could go back and load both files but synchronizing on the sparse file that contained raw data and read in the raw data and every 1000th event in the summary data file.
>> 
>> ** File Indexing **
>> 
>> I highly recommend reading the following article about a EventStore, which, in principle, can be used directly with HDDM, ROOT, evio,:
>> 
>> http://www.lepp.cornell.edu/~cdj/publications/conferences/CHEP04/EventStore.pdf
>> 
>> Having random-access capability with data is incredibly valuable.  Many of the physics channels in GlueX will be relatively sparse -- remember that our events will be dominated by diffractive vector meson production.  Something like EventStore, would allow us to develop sets of sparse overlapping skims according to various criteria without actually replicating events.  To go with the eta' example above, one can imagine an eta' skim in which the EventStore is used to hop through all reconstructed events that have an eta' and access the data provided by some eta' reconstruction factory.  These skim choices are flexible and can evolve as our physics analysis goals evolve.  They are just collections of the same reconstructed events on disk.  If one wants to use the EventStore to write out a (new) skimmed file to export offsite for further analysis, that would also be trivial to do.
>> 
>> ** Geting data from disk or from an algorithm **
>> 
>> The CLEO framework had the idea that data could either come from a factory or a source (EventStore for example).  If a factory was provided then the source was ignored.  The initial pass then enabled all the core reconstruction factories and wrote out the files.  The subsequent analysis pass (done by the user) did not load any of the core reconstruction factories and data was only provided from the files and high-level factories.  This scheme had the very useful feature that, in principle, some parts of the reconstruction could be redone at analysis time by simply loading the appropriate factories, provided that the data needed by these factories was also available.  In practice, this enabled analyses or studies "on the fringe" that would have otherwise been dismissed as impossible.  It wasn't the mainstream mode of operation (you don't want every analysis using specialized reconstruction routines), but it was a feature that was incredibly useful at times.
>> 
>> For example, suppose one is studying a channel where tracking algorithm is not quite optimized for some feature of the channel at the time of reconstruction.  Say it has multiple displaced vertices, low momentum tracks, etc.  If one can do a skim and to select some fraction of events that contain these characteristic events, then an improved tracking algorithm can be loaded, the raw data connected (via EventStore), and tracking can be redone easily only those events without changing any of the subsequent analysis code.  In practice, this probably involves staging raw data, which is some work, but the alternative (reprocessing the whole data set) may be impossible.
>> 
>> 
>> I'm sure others will add to this, and we can discuss it more over (a $28) coffee at the collaboration meeting.  Features like the EventStore, modular high-level reconstruction factories, and intuitive design of analysis objects greatly enhanced productivity at CLEO.  I think you would have a hard time finding a person in CLEO that will tell you their analysis efforts were hindered by the software, which I consider to be a success.  (People expect software to work perfectly and only complain when it doesn't.)
>> 
>> Matt
>> 
>> 
>> 
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline