[Halld-offline] Data Challenge Meeting Minutes, February 28, 2014

David Lawrence davidl at jlab.org
Mon Mar 3 08:16:14 EST 2014


Hi Justin,

  Can you clarify a bit. It sounds like you’re saying all 700 jobs failed to process all events and all 700 had small REST files. In other words a 100% failure rate. Is this correct?

-David

On Mar 3, 2014, at 6:52 AM, Justin Stevens <jrsteven at mit.edu> wrote:

> Hi Data Challengers,
> 
> Over the weekend I was running test jobs with branches/sim-recon-dc-2 and branches/hdds-dc-2 (checked out as of the time of Friday's meeting), and in hd_root (ie. creating the rest file) I switched to using the option -PHDDM:USE_COMPRESSION=0.  With 700 jobs finished so far, I haven't seen any jobs where the monitoring histograms showed the full 25K events were processed, but the REST file was small and didn't contain all 25K events I'm generating.  The fraction of "small" REST files was something like 2-3% when I was running with compression turned on earlier.  
> 
> These jobs were all using the 10^7 EM background with a gate of +/-800 ns, and I had 1 job which crashed in hdgeant, similar to what Paul has already reported:  
> 
> !!!!! ZFATAL called from MZPUSH
> 
> !!!!! ZFATAL reached from MZPUSH    for Case=  3
> 
> FYI,
> Justin
> 
> On Mar 2, 2014, at 10:02 PM, Mark Ito wrote:
> 
>> Colleagues,
>> 
>> Please fine the minutes below and at
>> 
>> https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014#Minutes
>> 
>> -- Mark
>> _________________________________________________________
>> 
>> GlueX Data Challenge Meeting, February 28, 2014
>> Minutes
>> 
>> Present:
>> * CMU: Paul Mattione, Curtis Meyer
>> * FSU: Volker Crede, Priyashree Roy, Aristeidis Tsaris,
>> * IU: Kei Moriya
>> * JLab: Mark Dalton, Mark Ito (chair), Chris Larrieu, Simon Taylor
>> * NU: Sean Dobbs
>> * UConn: Richard Jones
>> 
>> Announcements
>> 
>> * Mark announced an [21]update of the branch. Changes include:
>> 1. I fix from Simon for single-ended TOF counters.
>> 2. Improvements from Paul for cutting off processing for
>> multi-lap curling tracks.
>> 3. A change from David Lawrence to [22]allow compression to be
>> turned off in producing REST format data.
>> o David noticed that all three of the programs hdgeant,
>> mcsmear, and DANA produced HDDM-like output, but only
>> DANA has compression turned on (REST data in this case).
>> This feature will allow us to test if this has anything
>> to do with short REST files. On a side note, David
>> reported that the short-REST-file was not reproducible.
>> Mark produced some example hdgeant_smeared.hddm files
>> that produced short output for him to test.
>> 
>> Running Jobs at JLab
>> 
>> Mark has submitted some test jobs against the new branch. [Added in
>> press: 1,000 50 k-event jobs have been submitted.]
>> 
>> Status of Preparations
>> 
>> Random number seeds procedure
>> 
>> Paul spoke to David about this. It seems that mcsmear is currently
>> generating its own random number seed. We still have details to fill in
>> on this story.
>> 
>> Running Jobs at FSU
>> 
>> FSU has started running data challenge test jobs on their cluster.
>> Aristeidis has started with 50 jobs, but an early look shows problems
>> with some of them in hd_root. Also there was the GlueX-software-induced
>> crash of the FSU cluster[?].
>> 
>> Running jobs at CMU
>> 
>> Paul is seeing ZFATAL errors from hdgeant. He will send a bug report to
>> Richard who will look into a fix beyond merely increasing ZEBRA memory.
>> 
>> Richard asked about an issue where JANA takes a long time to identify a
>> CODA file as not an HDDM file. Richard would like to fix the HDDM
>> parser such that this is not the case. Mark D. will send Richard an
>> example.
>> 
>> Running Jobs at NU
>> 
>> Sean regaled us with tales of site specific problems.
>> 
>> Lots of jobs crashed at REST generation. Site configuration changes
>> helped. But there were still a lot of jobs hanging, usually with new
>> nodes. Reducing the number of submit slots fixed most of the problems.
>> Many of the remaining symptoms were jobs hung on the first event when
>> accessing the magnetic field. Jobs are single-threaded. [23]Some
>> statistics on the results were presented as well.
>> 
>> Richard remarked that on the OSG, jobs will start much faster if
>> declared as single-threaded.
>> 
>> Richard proposed the following standards:
>> 
>> BGRATE 1.1 (equivalent to 10^7) BGGATE -800 800 (in ns, time gate for
>> EM background addition)
>> 
>> We agreed on these as standard settings.
>> 
>> Mark proposed the following split of running:
>> 
>> 15% with no EM background 70% with EM background corresponding to 10^7
>> 15% with EM background corresponding to 5\×10^7
>> 
>> There was general agreement; adjustment may happen in the future.
>> 
>> Running Jobs at MIT
>> 
>> Justin has been running with the dc-2.2 tag. The OpenStack cluster at
>> MIT has about 180 cores and he has been running jobs for a couple of
>> days with good success. BGGATE was set at -200 to 200.
>> 
>> Electromagnetic Background
>> 
>> Kei gave us an [24]update on his studies of EM background with hdds-2.0
>> and sim-recon-dc-2.1. Slides covered:
>> * Memory Usage
>> * CPU time
>> * mcsmear File Sizes
>> * REST File Sizes
>> * Another Bad File
>> * Sum of parentid=0
>> * Correlation of CDC hits
>> * Correlation of FDC hits
>> * pπ^+π^- Events
>> 
>> Proposed Schedule
>> 
>> The schedule has slipped. The new schedule is as follows:
>> 1. Launch of Data Challenge Thursday March 6, 2014 (est.).
>> 2. Test jobs going successfully by Tuesday March 4.
>> 3. Distribution ready by Monday March 3.
>> 
>> Justin pointed out that the short REST file problem might be something
>> that we could live with for this data challenge.
>> 
>> Richard asked that Mark assign run numbers and run conditions for the
>> various sites.
>> 
>> Action Items
>> 
>> 1. Understand random number seed system.
>> 2. Solve ZFATAL crashes.
>> 3. Make a table of conditions vs. sites where the entries are assigned
>> file numbers.
>> 4. Report the broken Polycom in L207.
>> 
>> Retrieved from
>> "https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014"
>> 
>> References
>> 
>> 21. 
>> https://mailman.jlab.org/pipermail/halld-offline/2014-February/001511.html
>> 22. 
>> https://mailman.jlab.org/pipermail/halld-offline/2014-February/001512.html
>> 23. 
>> https://halldweb1.jlab.org/wiki/images/f/f8/DC2-Meeting-sdobbs-20140228.pdf
>> 24. https://halldweb1.jlab.org/wiki/images/e/ea/2014-02-28-DC2.pdf
>> 
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
> 
> 
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline




More information about the Halld-offline mailing list