[Halld-offline] [Halld-physics] Next Data Challenge...

Matthew Shepherd mashephe at indiana.edu
Mon Jul 2 09:50:49 EDT 2012


I see it as multiple steps to the same data challenge; however, I might argue about what the steps really are.  In my opinion the critical steps are reconstruction and analysis.  Of course to do this, we have to simulation also.  However, I think testing how multiple users utilize a single large reconstructed sample of data is a critical aspect of this challenge along with how we manage the processing and storage of the raw data to REST format.  It is something that we have not done much of, but it will be essential for any production analysis work.

The reason I'm so keen on the analysis step as being a part of the challenge is because it is essential for validating the reconstruction step.  We do not have CPU for each GlueXer to run their own reconstruction.  And we are planning that analysis jobs will run an order of magnitude faster than reconstruction jobs.  It is really important to push on the system to develop a common reconstruction that can feed multiple user-level analysis streams.  These analysis streams need to be more than generating a histogram -- they must involve things like kinematic fitting in order to demonstrate that sophisticated analysis algorithms can be run with just the REST format data.

Matt


On Jul 2, 2012, at 9:36 AM, Mark M. Ito wrote:

> David et al.,
> 
> I did not mean to imply that there should be two data challenges.
> 
> 1) one challenge is enough to start with
> 2) the real-data challenge involves involves simulation and 
> reconstruction and the simulated-data challenge involves simulation and 
> reconstruction (since we do not have real data), I'm not sure that I see 
> the essential difference
> 
> To me the "challenge" part of the data challenge is principally creating 
> a _system_ to do large scale things. We know we can do each of the 
> individual parts: simulate, smear, reconstruct, make histograms, do 
> fits, etc. at some level. And we can even to it on a large-ish scale if 
> we really had to. To get a day's worth of statistics we just repeat what 
> Jake did 8 times. And that is the point, we do not want to have Jake do 
> it 8 times, notwithstanding the fine young man that he is. We want a 
> system. And then we want to judge its performance. Therein lies the 
> challenge.
> 
> My other meta-point is that we start with something modest and add 
> capability and correctness. To me that is a system for simulation and a 
> system for reconstruction. Other things are relatively simple from a 
> system point of view. Those were the two bullets in my "initial scope" 
> proposal. Another way of saying this is that I think the main planning 
> job now is to eliminate tasks and features that we fully know we will 
> need eventually, but whose addition (and discussion!) would slow us down 
> initially. Let's not try to keep too many balls in the air.
> 
>   -- Mark
> 
> On 07/01/2012 11:30 PM, David Lawrence wrote:
>> 
>> Hi All,
>> 
>> 
>>   I think this a great conversation and a great start to gathering some specifics in regards to the Data Challenge. Here are a couple of comments:
>> 
>> 1.) Mark may have alluded to this in his blog, but I think we should make two distinct Data Challenges. One that will address how we deal with simulated data and the other on how we deal with real data. Our plan up to now, as I understand it, is to produce and analyze simulated data and store only the DST (or REST format as described at the last software meeting). I think that should be done as a separate exercise from the real data Data Challenge exercise.
>> 
>> 2.) For simulated data, there are two mechanisms we expect to use: a.) the JLab Scientific Computing farm and b.) the GRID. There will be some parts that the two systems will share, but some aspects will be different. One decision we need to make now is whether to test both of these in the 2013 Simulated Data Challenge, or focus only on the JLab farm.
>> 
>> 3.) For the real data Data Challenge, I think we should focus on generating the data on the JLab farm.   Our baseline plan as presented at the Software Review did not require large amounts of data to be imported to the JLab silo from offsite producers. Adding that mechanism to the 2013 Data Challenge(s) might make things more complicated.
>> 
>> 4.) For the real data Data Challenge, we should double check how much we save in CPU by storing the output of hdgeant rather than mcsmear (section 4.1 of document). Ideally, I'd like to push to get evio data written out that is close to what the raw data format will be. This will force us to check additional database systems that could be potential issues in offline processing.
>> 
>> That's my 2 cents for now.
>> 
>> Regards,
>> -David
>> 
>> ----- Original Message -----
>> From: "Mark M. Ito" <marki at jlab.org>
>> To: cmeyer at ernest.phys.cmu.edu
>> Cc: halld-offline at jlab.org
>> Sent: Saturday, June 30, 2012 6:00:49 PM
>> Subject: Re: [Halld-offline] [Halld-physics] Next Data Challenge...
>> 
>> Curtis,
>> 
>> I have outlined some thoughts at this URL:
>> 
>> http://markito3.wordpress.com/2012/06/30/ideas-for-a-data-challenge/
>> 
>> I think that they are somewhat complementary to the ideas you put into
>> the document.
>> 
>>    -- Mark
>> 
>> On 06/28/2012 02:47 PM, Curtis A. Meyer wrote:
>>> Dear Colleagues -
>>> 
>>>     as discussed in the offline meeting yesterday, we want to start
>>> planning for the next data
>>> challenge. People agreed to email me their ideas and I would then
>>> summarize them in a
>>> document that could then be discussed at the next meeting. In order to
>>> get this process going,
>>> I started to put this together as GlueX-doc-2031
>>> 
>>> http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2031
>>> 
>>> Please feel free to use this as a starting point, as it will probably help us converge sooner.
>>> 
>>>      thanks - Curtis
>>> 
>> 
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
> 
> -- 
> Mark M. Ito
> Jefferson Lab (www.jlab.org)
> (757)269-5295
> 
> 
> 
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline





More information about the Halld-offline mailing list