[Halld-offline] Data Challenge Meeting Minutes, July 16, 2012

Wed Jul 18 18:26:59 EDT 2012

Richard and David,

Now that Richard has convinced me that statelessness is to be respected, 
I will make the following point: if we put the config info into an 
opaque string, then there will be the temptation to pull back the 
curtain and use the information to influence the way we process the 
subsequent data. Just sayin'...

   -- Mark

On 07/18/2012 04:16 PM, Richard Jones wrote:
> Hello list,
>
>>   I think storing the config info as a string either in the first 
>> event or somewhere else in the header of the file using the existing 
>> mechanisms already built into HDDM is still a good idea and not just 
>> because of redundancy with the database. 
>
> Ok, can do.  How about just an opaque string?  Would that do?  I would 
> want to limit its length to something, say 30kB or so. Would that make 
> sense?  The format and meaning of the string would be opaque to the 
> analysis framework, and one would not need to make the hddm stream 
> stateful.  It would be like a decoration that an application could 
> print out in its log, and a human reader could browse for reference, 
> but the individual analysis threads would not be expected to have 
> deterministic access to it, right?
>
> -Richard J.
>
>
> On 7/18/2012 3:34 PM, David Lawrence wrote:
>>
>> Hi Richard,
>>
>>   I think storing the config info as a string either in the first 
>> event or somewhere else in the header of the file using the existing 
>> mechanisms already built into HDDM is still a good idea and not just 
>> because of redundancy with the database. One could imagine having a 
>> file and wanting to extract the configuration used to make it in 
>> order to view or use it. I see this as being very analogous to how 
>> the HDDM schema is stored in the front of the HDDM file and there is 
>> a tool to pull it out if needed. So to answer your devil's advocate 
>> question with another: Why not put the hddm schema in a database and 
>> not keep it in the file?
>>
>> Regards,
>> -David
>>
>> On 7/18/12 3:09 PM, Richard Jones wrote:
>>> Hello,
>>>
>>> I have no objection to storing a string tag for each object, 
>>> representing the GetTag() string from jana.  That can be done either 
>>> on an event-by-event basis or globally.  Event-by-event should only 
>>> be adopted if the analysis can handle the situation where tags 
>>> switch dynamically within a job, or we want to store more than one 
>>> tag (say both default and "KLOE" bcal clusters) and let the user 
>>> decide which to use.  That would require changes to the current 
>>> DEventSourceREST.cc, but would be easy to do.  If tags are stored 
>>> globally, then the hddm system will ensure automatically that only 
>>> streams with the same tag strings get merged together as a result of 
>>> a skim or by hddm-cat.  It would also provide a better way for the 
>>> danarest plugin to decide which tag to use for each output object, 
>>> instead of the provisional way I am handling it right now for 
>>> DBCALShower objects, which David points out is incorrect in some cases.
>>>
>>> As to the idea of flooding the REST file header with analysis 
>>> qualifiers, that is not something that hddm can do right now. I 
>>> could add the capability, but I question why.  The only function of 
>>> the hddm header, as currently conceived, is to document to the hddm 
>>> toolkit how to unpack the event data and what their meaning and 
>>> relationships are. That is all it does.  It is not a place to record 
>>> random comments like the name of the application that wrote the 
>>> file, or the command line switches.  User code does not normally 
>>> even access the header, it is just handled by the hddm library.  So 
>>> at present, storing runconfig-type information would require adding 
>>> special events to the stream, AND the huge change of making hddm 
>>> streams stateful....
>>>
>>> Just like root trees, hddm streams designed to be stateless. This is 
>>> an important design feature that I am not eager to concede. Think 
>>> about trying to stick config-type information into a root tree, and 
>>> then analyze it with a TSelector on PROOF.  You are going to have to 
>>> do major gymnastics to get that information to every analysis 
>>> session that gets started to run your job.  Building single-threaded 
>>> concepts like this into the analysis sounds like we are still 
>>> working like we did 20 years ago.
>>>
>>> It was not my original intent to embed metadata about the conditions 
>>> of the production inside the file, because I want later to be able 
>>> to string these events together and create skims.  In general I want 
>>> to avoid "stateful" streams in hddm, relying instead on the global 
>>> keys like runnumber,eventnumber to reference database records for 
>>> this information, similar to how root trees work.  By keeping the 
>>> streams stateless I avoid all kinds of ordering and synchronization 
>>> issues.  A related issue is the "skip to event NNN" action, which is 
>>> very fast in hddm because you don't have to read in every event. 
>>> Imagine a sparse skim, which in the limit would consist of one state 
>>> record for every event.  Do I stop and check every time I hit a 
>>> state record, do a bit-by-bit comparison with the current state, and 
>>> throw an exception on incompatible changes?  Much better to check 
>>> that compatibility ahead of time, from a database lookup, wouldn't it?
>>>
>>> If we want to make sure you don't lose the metadata, why not store 
>>> them in two separate databases, or promote the database to a higher 
>>> data security level?  We are not going to be able to analyze a REST 
>>> file without access to a database (required to re-swim reference 
>>> trajectories).  Playing devil's advocate, why are we not storing the 
>>> magnetic field map in the event file?  Without the magnetic field we 
>>> cannot get back the original DTrackTimeBased objects.
>>>
>>> -Richard J.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Halld-offline mailing list
>>> Halld-offline at jlab.org  <mailto:Halld-offline at jlab.org>
>>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20120718/1f5d1b01/attachment-0002.html>