[Halld-offline] Data Challenge Meeting Minutes, February 28, 2014

Mon Mar 3 10:05:41 EST 2014

Justin,

Good catch.  I am diagnosing now, expect to issue a fix by end of tomorrow
at latest.

-Richard J.

On Mon, Mar 3, 2014 at 8:28 AM, David Lawrence <davidl at jlab.org> wrote:

>
> Phew!  That’s good. You had me worried for a minute! Now that I re-read
> your e-mail, I see what you were saying the first time around. Thanks for
> the clarification.
>
>
> -David
>
> On Mar 3, 2014, at 8:24 AM, Justin Stevens <jrsteven at mit.edu> wrote:
>
> > Hi David,
> >
> > Sorry, what I was trying to say is the failure rate I'm seeing is 0%.
>  All of the 700 jobs processed all the events and they all had the correct
> size REST file which should contain all the events.  So by turning the
> compression off I don't see this problem with small REST files anymore.
>  Sorry for the poor wording early in the morning... it was before caffeine.
> >
> > -Justin
> >
> > On Mar 3, 2014, at 8:16 AM, David Lawrence wrote:
> >
> >>
> >> Hi Justin,
> >>
> >> Can you clarify a bit. It sounds like you’re saying all 700 jobs failed
> to process all events and all 700 had small REST files. In other words a
> 100% failure rate. Is this correct?
> >>
> >> -David
> >>
> >> On Mar 3, 2014, at 6:52 AM, Justin Stevens <jrsteven at mit.edu> wrote:
> >>
> >>> Hi Data Challengers,
> >>>
> >>> Over the weekend I was running test jobs with branches/sim-recon-dc-2
> and branches/hdds-dc-2 (checked out as of the time of Friday's meeting),
> and in hd_root (ie. creating the rest file) I switched to using the option
> -PHDDM:USE_COMPRESSION=0.  With 700 jobs finished so far, I haven't seen
> any jobs where the monitoring histograms showed the full 25K events were
> processed, but the REST file was small and didn't contain all 25K events
> I'm generating.  The fraction of "small" REST files was something like 2-3%
> when I was running with compression turned on earlier.
> >>>
> >>> These jobs were all using the 10^7 EM background with a gate of +/-800
> ns, and I had 1 job which crashed in hdgeant, similar to what Paul has
> already reported:
> >>>
> >>> !!!!! ZFATAL called from MZPUSH
> >>>
> >>> !!!!! ZFATAL reached from MZPUSH    for Case=  3
> >>>
> >>> FYI,
> >>> Justin
> >>>
> >>> On Mar 2, 2014, at 10:02 PM, Mark Ito wrote:
> >>>
> >>>> Colleagues,
> >>>>
> >>>> Please fine the minutes below and at
> >>>>
> >>>>
> https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014#Minutes
> >>>>
> >>>> -- Mark
> >>>> _________________________________________________________
> >>>>
> >>>> GlueX Data Challenge Meeting, February 28, 2014
> >>>> Minutes
> >>>>
> >>>> Present:
> >>>> * CMU: Paul Mattione, Curtis Meyer
> >>>> * FSU: Volker Crede, Priyashree Roy, Aristeidis Tsaris,
> >>>> * IU: Kei Moriya
> >>>> * JLab: Mark Dalton, Mark Ito (chair), Chris Larrieu, Simon Taylor
> >>>> * NU: Sean Dobbs
> >>>> * UConn: Richard Jones
> >>>>
> >>>> Announcements
> >>>>
> >>>> * Mark announced an [21]update of the branch. Changes include:
> >>>> 1. I fix from Simon for single-ended TOF counters.
> >>>> 2. Improvements from Paul for cutting off processing for
> >>>> multi-lap curling tracks.
> >>>> 3. A change from David Lawrence to [22]allow compression to be
> >>>> turned off in producing REST format data.
> >>>> o David noticed that all three of the programs hdgeant,
> >>>> mcsmear, and DANA produced HDDM-like output, but only
> >>>> DANA has compression turned on (REST data in this case).
> >>>> This feature will allow us to test if this has anything
> >>>> to do with short REST files. On a side note, David
> >>>> reported that the short-REST-file was not reproducible.
> >>>> Mark produced some example hdgeant_smeared.hddm files
> >>>> that produced short output for him to test.
> >>>>
> >>>> Running Jobs at JLab
> >>>>
> >>>> Mark has submitted some test jobs against the new branch. [Added in
> >>>> press: 1,000 50 k-event jobs have been submitted.]
> >>>>
> >>>> Status of Preparations
> >>>>
> >>>> Random number seeds procedure
> >>>>
> >>>> Paul spoke to David about this. It seems that mcsmear is currently
> >>>> generating its own random number seed. We still have details to fill
> in
> >>>> on this story.
> >>>>
> >>>> Running Jobs at FSU
> >>>>
> >>>> FSU has started running data challenge test jobs on their cluster.
> >>>> Aristeidis has started with 50 jobs, but an early look shows problems
> >>>> with some of them in hd_root. Also there was the
> GlueX-software-induced
> >>>> crash of the FSU cluster[?].
> >>>>
> >>>> Running jobs at CMU
> >>>>
> >>>> Paul is seeing ZFATAL errors from hdgeant. He will send a bug report
> to
> >>>> Richard who will look into a fix beyond merely increasing ZEBRA
> memory.
> >>>>
> >>>> Richard asked about an issue where JANA takes a long time to identify
> a
> >>>> CODA file as not an HDDM file. Richard would like to fix the HDDM
> >>>> parser such that this is not the case. Mark D. will send Richard an
> >>>> example.
> >>>>
> >>>> Running Jobs at NU
> >>>>
> >>>> Sean regaled us with tales of site specific problems.
> >>>>
> >>>> Lots of jobs crashed at REST generation. Site configuration changes
> >>>> helped. But there were still a lot of jobs hanging, usually with new
> >>>> nodes. Reducing the number of submit slots fixed most of the problems.
> >>>> Many of the remaining symptoms were jobs hung on the first event when
> >>>> accessing the magnetic field. Jobs are single-threaded. [23]Some
> >>>> statistics on the results were presented as well.
> >>>>
> >>>> Richard remarked that on the OSG, jobs will start much faster if
> >>>> declared as single-threaded.
> >>>>
> >>>> Richard proposed the following standards:
> >>>>
> >>>> BGRATE 1.1 (equivalent to 10^7) BGGATE -800 800 (in ns, time gate for
> >>>> EM background addition)
> >>>>
> >>>> We agreed on these as standard settings.
> >>>>
> >>>> Mark proposed the following split of running:
> >>>>
> >>>> 15% with no EM background 70% with EM background corresponding to 10^7
> >>>> 15% with EM background corresponding to 5\×10^7
> >>>>
> >>>> There was general agreement; adjustment may happen in the future.
> >>>>
> >>>> Running Jobs at MIT
> >>>>
> >>>> Justin has been running with the dc-2.2 tag. The OpenStack cluster at
> >>>> MIT has about 180 cores and he has been running jobs for a couple of
> >>>> days with good success. BGGATE was set at -200 to 200.
> >>>>
> >>>> Electromagnetic Background
> >>>>
> >>>> Kei gave us an [24]update on his studies of EM background with
> hdds-2.0
> >>>> and sim-recon-dc-2.1. Slides covered:
> >>>> * Memory Usage
> >>>> * CPU time
> >>>> * mcsmear File Sizes
> >>>> * REST File Sizes
> >>>> * Another Bad File
> >>>> * Sum of parentid=0
> >>>> * Correlation of CDC hits
> >>>> * Correlation of FDC hits
> >>>> * pπ^+π^- Events
> >>>>
> >>>> Proposed Schedule
> >>>>
> >>>> The schedule has slipped. The new schedule is as follows:
> >>>> 1. Launch of Data Challenge Thursday March 6, 2014 (est.).
> >>>> 2. Test jobs going successfully by Tuesday March 4.
> >>>> 3. Distribution ready by Monday March 3.
> >>>>
> >>>> Justin pointed out that the short REST file problem might be something
> >>>> that we could live with for this data challenge.
> >>>>
> >>>> Richard asked that Mark assign run numbers and run conditions for the
> >>>> various sites.
> >>>>
> >>>> Action Items
> >>>>
> >>>> 1. Understand random number seed system.
> >>>> 2. Solve ZFATAL crashes.
> >>>> 3. Make a table of conditions vs. sites where the entries are assigned
> >>>> file numbers.
> >>>> 4. Report the broken Polycom in L207.
> >>>>
> >>>> Retrieved from
> >>>> "
> https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014
> "
> >>>>
> >>>> References
> >>>>
> >>>> 21.
> >>>>
> https://mailman.jlab.org/pipermail/halld-offline/2014-February/001511.html
> >>>> 22.
> >>>>
> https://mailman.jlab.org/pipermail/halld-offline/2014-February/001512.html
> >>>> 23.
> >>>>
> https://halldweb1.jlab.org/wiki/images/f/f8/DC2-Meeting-sdobbs-20140228.pdf
> >>>> 24. https://halldweb1.jlab.org/wiki/images/e/ea/2014-02-28-DC2.pdf
> >>>>
> >>>> _______________________________________________
> >>>> Halld-offline mailing list
> >>>> Halld-offline at jlab.org
> >>>> https://mailman.jlab.org/mailman/listinfo/halld-offline
> >>>
> >>>
> >>> _______________________________________________
> >>> Halld-offline mailing list
> >>> Halld-offline at jlab.org
> >>> https://mailman.jlab.org/mailman/listinfo/halld-offline
> >>
> >
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20140303/1e338de7/attachment-0002.html>