[Halld-offline] Data Challenge Meeting Minutes, July 16, 2012

Wed Jul 18 15:34:04 EDT 2012

Hi Richard,

   I think storing the config info as a string either in the first event 
or somewhere else in the header of the file using the existing 
mechanisms already built into HDDM is still a good idea and not just 
because of redundancy with the database. One could imagine having a file 
and wanting to extract the configuration used to make it in order to 
view or use it. I see this as being very analogous to how the HDDM 
schema is stored in the front of the HDDM file and there is a tool to 
pull it out if needed. So to answer your devil's advocate question with 
another: Why not put the hddm schema in a database and not keep it in 
the file?

Regards,
-David

On 7/18/12 3:09 PM, Richard Jones wrote:
> Hello,
>
> I have no objection to storing a string tag for each object, 
> representing the GetTag() string from jana.  That can be done either 
> on an event-by-event basis or globally.  Event-by-event should only be 
> adopted if the analysis can handle the situation where tags switch 
> dynamically within a job, or we want to store more than one tag (say 
> both default and "KLOE" bcal clusters) and let the user decide which 
> to use.  That would require changes to the current 
> DEventSourceREST.cc, but would be easy to do.  If tags are stored 
> globally, then the hddm system will ensure automatically that only 
> streams with the same tag strings get merged together as a result of a 
> skim or by hddm-cat.  It would also provide a better way for the 
> danarest plugin to decide which tag to use for each output object, 
> instead of the provisional way I am handling it right now for 
> DBCALShower objects, which David points out is incorrect in some cases.
>
> As to the idea of flooding the REST file header with analysis 
> qualifiers, that is not something that hddm can do right now.  I could 
> add the capability, but I question why.  The only function of the hddm 
> header, as currently conceived, is to document to the hddm toolkit how 
> to unpack the event data and what their meaning and relationships are. 
> That is all it does.  It is not a place to record random comments like 
> the name of the application that wrote the file, or the command line 
> switches.  User code does not normally even access the header, it is 
> just handled by the hddm library.  So at present, storing 
> runconfig-type information would require adding special events to the 
> stream, AND the huge change of making hddm streams stateful....
>
> Just like root trees, hddm streams designed to be stateless.  This is 
> an important design feature that I am not eager to concede. Think 
> about trying to stick config-type information into a root tree, and 
> then analyze it with a TSelector on PROOF.  You are going to have to 
> do major gymnastics to get that information to every analysis session 
> that gets started to run your job.  Building single-threaded concepts 
> like this into the analysis sounds like we are still working like we 
> did 20 years ago.
>
> It was not my original intent to embed metadata about the conditions 
> of the production inside the file, because I want later to be able to 
> string these events together and create skims.  In general I want to 
> avoid "stateful" streams in hddm, relying instead on the global keys 
> like runnumber,eventnumber to reference database records for this 
> information, similar to how root trees work.  By keeping the streams 
> stateless I avoid all kinds of ordering and synchronization issues.  A 
> related issue is the "skip to event NNN" action, which is very fast in 
> hddm because you don't have to read in every event. Imagine a sparse 
> skim, which in the limit would consist of one state record for every 
> event.  Do I stop and check every time I hit a state record, do a 
> bit-by-bit comparison with the current state, and throw an exception 
> on incompatible changes?  Much better to check that compatibility 
> ahead of time, from a database lookup, wouldn't it?
>
> If we want to make sure you don't lose the metadata, why not store 
> them in two separate databases, or promote the database to a higher 
> data security level?  We are not going to be able to analyze a REST 
> file without access to a database (required to re-swim reference 
> trajectories).  Playing devil's advocate, why are we not storing the 
> magnetic field map in the event file?  Without the magnetic field we 
> cannot get back the original DTrackTimeBased objects.
>
> -Richard J.
>
>
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20120718/190dabcb/attachment-0002.html>