[Halld-offline] Data Challenge Meeting Minutes, July 16, 2012

Wed Jul 18 08:59:17 EDT 2012

Hello,

 From the minutes of the last meeting:

> Action Items
>
>       1. Make a list of milestones.
>       2. Do a micro-data challenge with jproj.pl. -> Mark
>       3. Check in the final REST format. -> Richard
The meeting was cut off before I could tell the whole story.  It was completed, tested, and checked in before Monday's meeting. But with caveats.

 1. final event size is 3.5kB (2.4kB on disk because of bzip2 compression)
 2. bzip2 decomp overhead is negligible, but reference_trajectory reconstruction costs ~ 2ms (3.5 GHz i7) per track, 15 tracks average per 5pi,p event at 9 GeV.
 3. when storing DBCALShower objects, there is the ambiguity of whether to use KLOE or default bcal clustering.  Right now I query the DNeutralShower object for its USE_KLOE parameter, which is switchable using a command-line parameter.  In case you are not familiar with the difference, there is just one DBCALShower object in dana, but jana objects carry "tags" which tell something about where they came from.  Right now there is a factory that produces DBCALShower objects with the tag "KLOE", and another one that produces THE SAME objects using a different algorithm, with the default tag "".
 4. when reading back DBCALShower objects, I will unpack them (agnostic as to how they were made) either as KLOE or default bcal clusters, depending on what the user asks for.  This has the "feature" that the person who wrote the REST file might write KLOE clusters and the person who reads it might fetch them as default clusters.  This might be the correct behavior.  If not, there is another way to go: create separate lists in the REST format for each type of each object that one foresees, and have the writer only populate one of them.  The overhead for doing this is negligible, but I think it is ugly.  IMO, there is just one conceptual object known as a DBCALShower, but many ways to make it.  Within the KLOE or default factories there are many internal settings that one might switch around.  Are we going to try to save all of that information in the output file on an event-by-event basis? Why not put it in the database Mark was talking about, where a single entry can cover
    many events, event multiple files within a production run.


-Richard J.


On 7/17/2012 5:03 PM, Mark M. Ito wrote:
> Folks,
>
> Find the meeting minutes below and at
>
>     
> https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_July_16,_2012#Minutes
>
>     -- Mark
> _____
>
> GlueX Data Challenge Meeting, July 16, 2012
> Minutes
>
>      Present:
>        * CMU: Paul Mattione, Curtis Meyer
>        * IU: Matt Shepherd
>        * JLab: Eugene Chudakov, Mark Ito, David Lawrence
>        * UConn: Richard Jones
>
> Announcements
>
>      Mark  reported  that  the Computer Center was given a heads-up that we
>      will  start  working  on  the data challenge. Batch jobs will start to
>      appear  on  the  JLab farm. We will have a tape volume set that can be
>      recycled.
>
> Scope of this Challenge
>
>        * We  agreed to continue on the course of using Curtis's document as
>          a  repository  of  our ideas about the data challenge (DC). Curtis
>          agreed  to convert document 2031 into a Wiki page. The page is now
>          available as Planning for The Next GlueX Data Challenge.
>        * Curtis  reminded us that we need a robust storage resource manager
>          (SRM) for the DC.
>        * Grid  and  JLab:  Mark  asked  about  whether  we  want  to pursue
>          large-scale  production  on the Grid, at JLab, or both. We decided
>          to pursue both.
>        * Matt thought that a huge sample of Pythia data would be sufficient
>          to  address main goals of the DC. Physics signals are of a smaller
>          scale and can be generated in a more ad hoc mannger.
>        * We  talked about December or January as a tentative time frame for
>          doing  the  first  DC.  In  the future we will have to set-up more
>          formal milestones.
>

> Mini-Data Challenges
>
>      It   looks   like  the  work  management  system  packages  that  were
>      recommended  for  us may not be appropriate for tracking jobs at JLab.
>      They are oriented toward a grid-based environment.
>
>      Mark described a perl script, jproj.pl, he wrote to manage jobs on the
>      JLab  farm for the PrimEx experiment. It uses a MySQL database to keep
>      track of jobs and output files and handles multiple job submission. It
>      is  driven  off  a  list  of  input  files  to be processed. Multiple,
>      simultaneous projects are supported. Some assumptions about filenames,
>      and  processing  conventions  are made to simplify the script. He will
>      use  a  modified  version  to  get started processing multiple jobs at
>      JLab.
>
>      Richard   reminded   us   that  his  gridmake  system  offers  similar
>      functionality.  Mark  agreed  to  look  at  it  as  a  possible,  more
>      sophisticated replacement for jproj.pl.
>
>      At the last offline meeting Mark described the idea of doing multiple,
>      scheduled,  medium-scale  data challenges as a development environment
>      for  the  tools do a large-scale DC. The idea is to expand scope as we
>      go  from  mini-DC  to  mini-DC,  testing  ideas  as we go. There was a
>      consensus around following this approach, at least initially.
>
> Analysis System Design Plan
>
>      Paul  presented  some classes for automating the selection of particle
>      combinations  and  doing  kinematic  fits  on  those  combinations  by
>      specifying the reaction to be studied when construction the class. See
>      his wiki page for details. He has other related ideas which he will be
>      developing in the near future.
>
>      Matt  proposed  that we start doing analysis with simple, individually
>      developed  scripts  and  incrementally develop system(s) based on that
>      experience.
>
> Finalization of REST Format: record size and performance numbers
>
>      Richard discussed the changes he made to REST format based on comments
>      from  the last offline meeting and subsequent conversations with Paul.
>      He   made   the   switch   from   using   DChargedTrackHypothesis   to
>      DTrackTimeBased. He sees a performance hit when the change is made due
>      to the need to swim a trajectory. The reconstitution rate is 30 Hz for
>      the  events he studied. This compares with a reconstruction rate (from
>      raw  hits)  of 1.8 Hz. We agreed that the flexibility in analysis with
>      new scheme was worth the extra processing time.
>
> Action Items
>
>       1. Make a list of milestones.
>       2. Do a micro-data challenge with jproj.pl. -> Mark
>       3. Check in the final REST format. -> Richard
>
>      Retrieved from
>      
> "https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_July_16,_2012"
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20120718/ca5f9dab/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3232 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20120718/ca5f9dab/attachment.p7s>