[Halld-offline] [Halld-physics] Next Data Challenge...

Thu Jul 5 16:40:28 EDT 2012

Hi Mark et al.,

   I haven't heard anything on this since the Physics meeting, but 
wanted to give a response.

I think we agree that there are several things that could be addressed 
in the data challenge, and that everyone seems to have a different idea 
of what's important and what is either low risk or otherwise defer-able.

Having said that, I do have to disagree with your point 2) below. In the 
actual experiment we will have two distinct activities: Simulation and 
Reconstruction. The Simulation activity will also include reconstruction 
of the simulated data in the same farm job. The Reconstruction activity 
will use only real data that is obtained by a different mechanism. I 
elaborate by listing a few differences of the two activities below.

Simulation
---------------
a. Does not require coupling to Jasmine (tape silo) for input
b. Jobs are identical (can be submitted with single job script 
specifying MULTI_JOBS)
c. Identical jobs means failed jobs can be replaced by submitting a new 
batch (i.e. no need to resubmit a job targeted at a specific set of files)
d. Single threaded simulation encourages use of single-threaded 
reconstruction (CPU=1)

Reconstruction
---------------
a. Requires coupling to Jasmine for input
b. Jobs are not identical (each one targets a specific set of input 
files so multiple job scripts must be submitted)
c. Failed jobs must be resubmitted individually (requires detailed log 
system)
d. Reconstruction can be run multi-threaded (CPU=32)

I think the Data Challenge should really pick one of these activities 
and focus on doing it. I understand we can't test everything exactly as 
it will be in the final system. However, I'm very concerned that doing a 
single data challenge and claiming it applies to both is just not the 
best use of our time. A single challenge is going to be testing only 
bits and pieces any complete system we would devise for either of the 
actual activities we'll be doing. I agree with Mark that we need to 
focus on a SYSTEM, but I think it needs to be one that can be grown into 
a final system rather than a throw-away that just does some 
proofs-of-principle.

Here is what I think the steps of a Reconstruction Data Challenge should 
be. These overlap closely with the goals Curtis put in the note.

1. Generate a large dataset (equivalent to 1 week of low luminosity data)
    - Data should include Pythia+bggen events with some appropriate amount
    - of  (5pi)  and (3pi) events mixed in
    - The data should be in EVIO format with values specified by 
Crate/Slot/Channel
       we don't need the final translation table, a made up one would be 
fine
    - This step doesn't test anything, just sets the stage to do the 
actual challenge.
       It does, of course entail a considerable amount of work

2. Process the data on the farm and record in DST files on the silo
     - Mechanism to generate scripts for the jobs must be developed
     - Bookkeeping of failed/succeeded jobs must be done so they can be
        resubmitted if necessary. Database should be developed (perhaps
        rudimentary, but with intent to grow it into final version)
     - Run filters on the reconstructed dataset (same farm job) to pull out
        all (2pi+ 2pi-) inclusive events and write them to a file. Also 
pull out
        some prescaled (pi+ pi- pi0) events to a second file.
     - Put all DST files on tape silo.

3. Transfer data to offsite analysis site.
     - Two sites should transfer the DSTs. Each should focus on just one of
        the two filtered data sets
     - Analyze the DST in 2 stages: a.) using kinematic fitter b.) amplitude
        analysis using AmpTools.

I think it will take 4-6 months to get to the point where 1. is 
complete. Possibly another 1-2 months for 2. including Database 
development and monitoring tools development and/or implementation. In 
the end, we will have a lot more infrastructure in place to handle the 
real data reconstruction chain.

Regards,
-David

On 7/2/12 9:36 AM, Mark M. Ito wrote:
> David et al.,
>
> I did not mean to imply that there should be two data challenges.
>
> 1) one challenge is enough to start with
> 2) the real-data challenge involves involves simulation and 
> reconstruction and the simulated-data challenge involves simulation 
> and reconstruction (since we do not have real data), I'm not sure that 
> I see the essential difference
>
> To me the "challenge" part of the data challenge is principally 
> creating a _system_ to do large scale things. We know we can do each 
> of the individual parts: simulate, smear, reconstruct, make 
> histograms, do fits, etc. at some level. And we can even to it on a 
> large-ish scale if we really had to. To get a day's worth of 
> statistics we just repeat what Jake did 8 times. And that is the 
> point, we do not want to have Jake do it 8 times, notwithstanding the 
> fine young man that he is. We want a system. And then we want to judge 
> its performance. Therein lies the challenge.
>
> My other meta-point is that we start with something modest and add 
> capability and correctness. To me that is a system for simulation and 
> a system for reconstruction. Other things are relatively simple from a 
> system point of view. Those were the two bullets in my "initial scope" 
> proposal. Another way of saying this is that I think the main planning 
> job now is to eliminate tasks and features that we fully know we will 
> need eventually, but whose addition (and discussion!) would slow us 
> down initially. Let's not try to keep too many balls in the air.
>
>   -- Mark
>
> On 07/01/2012 11:30 PM, David Lawrence wrote:
>>
>> Hi All,
>>
>>
>>    I think this a great conversation and a great start to gathering 
>> some specifics in regards to the Data Challenge. Here are a couple of 
>> comments:
>>
>> 1.) Mark may have alluded to this in his blog, but I think we should 
>> make two distinct Data Challenges. One that will address how we deal 
>> with simulated data and the other on how we deal with real data. Our 
>> plan up to now, as I understand it, is to produce and analyze 
>> simulated data and store only the DST (or REST format as described at 
>> the last software meeting). I think that should be done as a separate 
>> exercise from the real data Data Challenge exercise.
>>
>> 2.) For simulated data, there are two mechanisms we expect to use: 
>> a.) the JLab Scientific Computing farm and b.) the GRID. There will 
>> be some parts that the two systems will share, but some aspects will 
>> be different. One decision we need to make now is whether to test 
>> both of these in the 2013 Simulated Data Challenge, or focus only on 
>> the JLab farm.
>>
>> 3.) For the real data Data Challenge, I think we should focus on 
>> generating the data on the JLab farm.   Our baseline plan as 
>> presented at the Software Review did not require large amounts of 
>> data to be imported to the JLab silo from offsite producers. Adding 
>> that mechanism to the 2013 Data Challenge(s) might make things more 
>> complicated.
>>
>> 4.) For the real data Data Challenge, we should double check how much 
>> we save in CPU by storing the output of hdgeant rather than mcsmear 
>> (section 4.1 of document). Ideally, I'd like to push to get evio data 
>> written out that is close to what the raw data format will be. This 
>> will force us to check additional database systems that could be 
>> potential issues in offline processing.
>>
>> That's my 2 cents for now.
>>
>> Regards,
>> -David
>>
>> ----- Original Message -----
>> From: "Mark M. Ito" <marki at jlab.org>
>> To: cmeyer at ernest.phys.cmu.edu
>> Cc: halld-offline at jlab.org
>> Sent: Saturday, June 30, 2012 6:00:49 PM
>> Subject: Re: [Halld-offline] [Halld-physics] Next Data Challenge...
>>
>> Curtis,
>>
>> I have outlined some thoughts at this URL:
>>
>> http://markito3.wordpress.com/2012/06/30/ideas-for-a-data-challenge/
>>
>> I think that they are somewhat complementary to the ideas you put into
>> the document.
>>
>>     -- Mark
>>
>> On 06/28/2012 02:47 PM, Curtis A. Meyer wrote:
>>> Dear Colleagues -
>>>
>>>      as discussed in the offline meeting yesterday, we want to start
>>> planning for the next data
>>> challenge. People agreed to email me their ideas and I would then
>>> summarize them in a
>>> document that could then be discussed at the next meeting. In order to
>>> get this process going,
>>> I started to put this together as GlueX-doc-2031
>>>
>>> http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2031 
>>>
>>>
>>> Please feel free to use this as a starting point, as it will 
>>> probably help us converge sooner.
>>>
>>>       thanks - Curtis
>>>
>>
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>