[Halld-offline] Data Challenge Meeting Minutes, July 16, 2012
David Lawrence
davidl at jlab.org
Wed Jul 18 10:05:02 EDT 2012
Hi Richard,
Thanks for this nice summary. I have a few comments:
1. It should be noted that the DNeutralShower_factory class currently
implements a duplicate mechanism
for allowing the user to specify the KLOE algorithm be used as opposed
to the default. The code as it is,
implements a configuration parameter BCALRECON:USE_KLOE which it
defaults to "1". Since it is a
JANA Configuration Parameter, the user can change it via the command
line using:
-PBCALRECON:USE_KLOE=0
However, JANA has a built in mechanism to do exactly this sort of thing
via the "DEFTAG:*" configuration
parameter. Specifically, the user could specify the following:
-PDEFTAG:DBCALShower=KLOE
Then *all* calls like:
loop->Get(locBCALShowers)
will use the KLOE-algorithm-built objects. This is important since if
another factory or event processor
wants to use the DBCALShower objects, they don't have to implement
checking of a configuration
parameter themselves in order to get the same objects used by every
other place else in the code.
2. Every data object inherits from JObject which provides a GetTag()
method. Calling this for just the
first object returned in a call to loop->Get(...) will tell you the
factory used to make all of the
objects in the list. (There is one use situation where this wouldn't be
strictly true and you'd need to
query every object for its maker's tag, but that would be highly
unusual.) In other words, you shouldn't
ever have to query a factory object directly to figure out how an object
was made.
3. I have always envisioned that whenever a processed data file was
written out, a complete set of
configuration parameters would be written out with it. That is required
to reproduce the results
of a given job. In fact, there is a command
line option available in all JANA programs: --dumpconfig that creates a
text file with all configuration
parameters and their settings at the end of processing. The file is of a
format that can be read
in using --config=filename. This file could be put in the header (or
even better, in the first event).
It would include all of the DEFTAG:* settings.
I agree with Mark too that having this info in both the file AND a
database is good. One is often
much more convenient than the other depending on what you're doing.
Regards,
-David
On 7/18/12 8:59 AM, Richard Jones wrote:
> Hello,
>
> From the minutes of the last meeting:
>
>> Action Items
>>
>> 1. Make a list of milestones.
>> 2. Do a micro-data challenge with jproj.pl. -> Mark
>> 3. Check in the final REST format. -> Richard
> The meeting was cut off before I could tell the whole story. It was
> completed, tested, and checked in before Monday's meeting. But with
> caveats.
>
> 1. final event size is 3.5kB (2.4kB on disk because of bzip2 compression)
> 2. bzip2 decomp overhead is negligible, but reference_trajectory
> reconstruction costs ~ 2ms (3.5 GHz i7) per track, 15 tracks
> average per 5pi,p event at 9 GeV.
> 3. when storing DBCALShower objects, there is the ambiguity of
> whether to use KLOE or default bcal clustering. Right now I query
> the DNeutralShower object for its USE_KLOE parameter, which is
> switchable using a command-line parameter. In case you are not
> familiar with the difference, there is just one DBCALShower object
> in dana, but jana objects carry "tags" which tell something about
> where they came from. Right now there is a factory that produces
> DBCALShower objects with the tag "KLOE", and another one that
> produces THE SAME objects using a different algorithm, with the
> default tag "".
> 4. when reading back DBCALShower objects, I will unpack them
> (agnostic as to how they were made) either as KLOE or default bcal
> clusters, depending on what the user asks for. This has the
> "feature" that the person who wrote the REST file might write KLOE
> clusters and the person who reads it might fetch them as default
> clusters. This might be the correct behavior. If not, there is
> another way to go: create separate lists in the REST format for
> each type of each object that one foresees, and have the writer
> only populate one of them. The overhead for doing this is
> negligible, but I think it is ugly. IMO, there is just one
> conceptual object known as a DBCALShower, but many ways to make
> it. Within the KLOE or default factories there are many internal
> settings that one might switch around. Are we going to try to
> save all of that information in the output file on an
> event-by-event basis? Why not put it in the database Mark was
> talking about, where a single entry can cover many events, event
> multiple files within a production run.
>
>
> -Richard J.
>
>
>
> On 7/17/2012 5:03 PM, Mark M. Ito wrote:
>> Folks,
>>
>> Find the meeting minutes below and at
>>
>>
>> https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_July_16,_2012#Minutes
>>
>> -- Mark
>> _____
>>
>> GlueX Data Challenge Meeting, July 16, 2012
>> Minutes
>>
>> Present:
>> * CMU: Paul Mattione, Curtis Meyer
>> * IU: Matt Shepherd
>> * JLab: Eugene Chudakov, Mark Ito, David Lawrence
>> * UConn: Richard Jones
>>
>> Announcements
>>
>> Mark reported that the Computer Center was given a heads-up that we
>> will start working on the data challenge. Batch jobs will start to
>> appear on the JLab farm. We will have a tape volume set that can be
>> recycled.
>>
>> Scope of this Challenge
>>
>> * We agreed to continue on the course of using Curtis's document as
>> a repository of our ideas about the data challenge (DC). Curtis
>> agreed to convert document 2031 into a Wiki page. The page is now
>> available as Planning for The Next GlueX Data Challenge.
>> * Curtis reminded us that we need a robust storage resource manager
>> (SRM) for the DC.
>> * Grid and JLab: Mark asked about whether we want to pursue
>> large-scale production on the Grid, at JLab, or both. We decided
>> to pursue both.
>> * Matt thought that a huge sample of Pythia data would be sufficient
>> to address main goals of the DC. Physics signals are of a smaller
>> scale and can be generated in a more ad hoc mannger.
>> * We talked about December or January as a tentative time frame for
>> doing the first DC. In the future we will have to set-up more
>> formal milestones.
>>
>> Mini-Data Challenges
>>
>> It looks like the work management system packages that were
>> recommended for us may not be appropriate for tracking jobs at JLab.
>> They are oriented toward a grid-based environment.
>>
>> Mark described a perl script, jproj.pl, he wrote to manage jobs on the
>> JLab farm for the PrimEx experiment. It uses a MySQL database to keep
>> track of jobs and output files and handles multiple job submission. It
>> is driven off a list of input files to be processed. Multiple,
>> simultaneous projects are supported. Some assumptions about filenames,
>> and processing conventions are made to simplify the script. He will
>> use a modified version to get started processing multiple jobs at
>> JLab.
>>
>> Richard reminded us that his gridmake system offers similar
>> functionality. Mark agreed to look at it as a possible, more
>> sophisticated replacement for jproj.pl.
>>
>> At the last offline meeting Mark described the idea of doing multiple,
>> scheduled, medium-scale data challenges as a development environment
>> for the tools do a large-scale DC. The idea is to expand scope as we
>> go from mini-DC to mini-DC, testing ideas as we go. There was a
>> consensus around following this approach, at least initially.
>>
>> Analysis System Design Plan
>>
>> Paul presented some classes for automating the selection of particle
>> combinations and doing kinematic fits on those combinations by
>> specifying the reaction to be studied when construction the class. See
>> his wiki page for details. He has other related ideas which he will be
>> developing in the near future.
>>
>> Matt proposed that we start doing analysis with simple, individually
>> developed scripts and incrementally develop system(s) based on that
>> experience.
>>
>> Finalization of REST Format: record size and performance numbers
>>
>> Richard discussed the changes he made to REST format based on comments
>> from the last offline meeting and subsequent conversations with Paul.
>> He made the switch from using DChargedTrackHypothesis to
>> DTrackTimeBased. He sees a performance hit when the change is made due
>> to the need to swim a trajectory. The reconstitution rate is 30 Hz for
>> the events he studied. This compares with a reconstruction rate (from
>> raw hits) of 1.8 Hz. We agreed that the flexibility in analysis with
>> new scheme was worth the extra processing time.
>>
>> Action Items
>>
>> 1. Make a list of milestones.
>> 2. Do a micro-data challenge with jproj.pl. -> Mark
>> 3. Check in the final REST format. -> Richard
>>
>> Retrieved from
>>
>> "https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_July_16,_2012" <https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_July_16,_2012>
>>
>>
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org <mailto:Halld-offline at jlab.org>
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20120718/c14a42a0/attachment-0002.html>
More information about the Halld-offline
mailing list