[Halld-offline] Data Challenge Meeting Minutes, February 28, 2014

Wed Mar 5 13:29:55 EST 2014

>From NU:  I ran 500 jobs of 20k events each with EM backgrounds, and got 4
small REST files (with compression enabled).  The job time distributions
have essentially the same shape as the ones I showed on Monday, just with
the x-axis multipled by two.

I think that Paul's explanation of the small REST file distribution being
related to these big events sounds plausible. Over the weekend I ran a
bunch of batch jobs with hd_ana running through gdb, and the only jobs I
saw that had any problems were ones that took too long in track
reconstruction.

On Wed, Mar 5, 2014 at 12:27 PM, Sean Dobbs <seandobbs at gmail.com> wrote:

> From NU:  I ran 500 jobs of 20k events each with EM backgrounds, and got 4
> small REST files (with compression enabled).  The job time distributions
> have essentially the same shape as the ones I showed on Monday, just with
> the x-axis multipled by two.
>
> I think that Paul's explanation of the small REST file distribution being
> related to these big events sounds plausible. Over the weekend I ran a
> bunch of batch jobs with hd_ana running through gdb, and the only jobs I
> saw that had any problems were ones that took too long in track
> reconstruction.
>
> ---Sean
>
>
>
> On Wed, Mar 5, 2014 at 9:14 AM, Paul Mattione <pmatt at jlab.org> wrote:
>
>>  Another update from CMU:
>>
>>  Ran 63 jobs, 50k-events each, with no EM background, with 2 GB memory
>> assigned for the jobs.  No crashes, but I have 5 small REST files.  So
>> apparently it's still a problem.  Maybe I just got lucky on the last batch
>> (also, note that there were 1/2 as many events last time).
>>
>>   - Paul
>>
>>  On Mar 4, 2014, at 10:42 AM, Paul Mattione wrote:
>>
>>  Update from CMU:
>>
>>  Ran 100 jobs, 25k-events each, at BGRATE 1.1, BGGATE -800, 800, with 2
>> GB of memory assigned for the jobs.  All jobs took 23-24 hours to run.  No
>> crashes, all stages 100% OK (no small REST or hdgeant files).  REST file
>> compression was enabled.
>>
>>  Could the small REST file problem have been related to the memory
>> spikes that were happening in the CDC?  Has anyone seen small REST files
>> with the latest version of the branch code?
>>
>>  The hdgeant crashes are a much bigger problem at a BGRATE 5.5; I'm
>> going to run some test jobs at that rate again and make sure I can
>> reproduce it.
>>
>>   - Paul
>>
>>  On Mar 3, 2014, at 12:11 PM, Sean Dobbs wrote:
>>
>>
>> Hi all,
>>
>>  An update from my end:  A few jobs are still making their way through
>> the queue, but I've attached a PDF of some interesting job time
>> distributions.  I ran 500 jobs simulating 10k events each with the EM
>> background parameters agreed to at the meeting (BGRATE 1.1, BGGATE -800
>> 800).  The running times for mcsmear and hd_ana are stable, but I'm
>> seeing a large distribution of running time for hdgeant.  The tails to
>> larger running time (>7 hours) are from the newer nodes with hyperthreading
>> turned on, so I assume the generation of the EM background is CPU-bound.
>>
>>  I'm planning to benchmark some jobs with 20k events, but am trying to
>> reproduce some of these other problems first.
>>
>>
>>  Regarding some of the issues I mentioned at the meeting: I had gotten
>> the  job failure rate down to ~5%, but eventually figured out that the
>> cause of the jobs failing here was due to forgetting that the thread
>> timeout parameters for the first event and all other events are different.
>>  The reason that these jobs would sometimes not exit properly and hang
>> around in the queue was because they were waiting on a lock in the
>> destructor for DGeometry - I imagine this was because the main thread was
>> in the middle of some tracking calculation using the material maps when it
>> was killed.
>>
>>  Also, I can report that hgeant with EM simulation on runs fine with on
>> machines with ~500 MB physical memory per core, but runs very slow on
>> machines with ~250 MB per core, which isn't terribly surprising, I imagine
>> (we have a few of these).
>>
>>
>>
>> On Sun, Mar 2, 2014 at 9:02 PM, Mark Ito <marki at jlab.org> wrote:
>>
>>> Colleagues,
>>>
>>> Please fine the minutes below and at
>>>
>>>
>>> https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014#Minutes
>>>
>>> -- Mark
>>> _________________________________________________________
>>>
>>> GlueX Data Challenge Meeting, February 28, 2014
>>> Minutes
>>>
>>> Present:
>>> * CMU: Paul Mattione, Curtis Meyer
>>> * FSU: Volker Crede, Priyashree Roy, Aristeidis Tsaris,
>>> * IU: Kei Moriya
>>> * JLab: Mark Dalton, Mark Ito (chair), Chris Larrieu, Simon Taylor
>>> * NU: Sean Dobbs
>>> * UConn: Richard Jones
>>>
>>> Announcements
>>>
>>> * Mark announced an [21]update of the branch. Changes include:
>>> 1. I fix from Simon for single-ended TOF counters.
>>> 2. Improvements from Paul for cutting off processing for
>>> multi-lap curling tracks.
>>> 3. A change from David Lawrence to [22]allow compression to be
>>> turned off in producing REST format data.
>>> o David noticed that all three of the programs hdgeant,
>>> mcsmear, and DANA produced HDDM-like output, but only
>>> DANA has compression turned on (REST data in this case).
>>> This feature will allow us to test if this has anything
>>> to do with short REST files. On a side note, David
>>> reported that the short-REST-file was not reproducible.
>>> Mark produced some example hdgeant_smeared.hddm files
>>> that produced short output for him to test.
>>>
>>> Running Jobs at JLab
>>>
>>> Mark has submitted some test jobs against the new branch. [Added in
>>> press: 1,000 50 k-event jobs have been submitted.]
>>>
>>> Status of Preparations
>>>
>>> Random number seeds procedure
>>>
>>> Paul spoke to David about this. It seems that mcsmear is currently
>>> generating its own random number seed. We still have details to fill in
>>> on this story.
>>>
>>> Running Jobs at FSU
>>>
>>> FSU has started running data challenge test jobs on their cluster.
>>> Aristeidis has started with 50 jobs, but an early look shows problems
>>> with some of them in hd_root. Also there was the GlueX-software-induced
>>> crash of the FSU cluster[?].
>>>
>>> Running jobs at CMU
>>>
>>> Paul is seeing ZFATAL errors from hdgeant. He will send a bug report to
>>> Richard who will look into a fix beyond merely increasing ZEBRA memory.
>>>
>>> Richard asked about an issue where JANA takes a long time to identify a
>>> CODA file as not an HDDM file. Richard would like to fix the HDDM
>>> parser such that this is not the case. Mark D. will send Richard an
>>> example.
>>>
>>> Running Jobs at NU
>>>
>>> Sean regaled us with tales of site specific problems.
>>>
>>> Lots of jobs crashed at REST generation. Site configuration changes
>>> helped. But there were still a lot of jobs hanging, usually with new
>>> nodes. Reducing the number of submit slots fixed most of the problems.
>>> Many of the remaining symptoms were jobs hung on the first event when
>>> accessing the magnetic field. Jobs are single-threaded. [23]Some
>>> statistics on the results were presented as well.
>>>
>>> Richard remarked that on the OSG, jobs will start much faster if
>>> declared as single-threaded.
>>>
>>> Richard proposed the following standards:
>>>
>>> BGRATE 1.1 (equivalent to 10^7) BGGATE -800 800 (in ns, time gate for
>>> EM background addition)
>>>
>>> We agreed on these as standard settings.
>>>
>>> Mark proposed the following split of running:
>>>
>>> 15% with no EM background 70% with EM background corresponding to 10^7
>>> 15% with EM background corresponding to 5\×10^7
>>>
>>> There was general agreement; adjustment may happen in the future.
>>>
>>> Running Jobs at MIT
>>>
>>> Justin has been running with the dc-2.2 tag. The OpenStack cluster at
>>> MIT has about 180 cores and he has been running jobs for a couple of
>>> days with good success. BGGATE was set at -200 to 200.
>>>
>>> Electromagnetic Background
>>>
>>> Kei gave us an [24]update on his studies of EM background with hdds-2.0
>>> and sim-recon-dc-2.1. Slides covered:
>>> * Memory Usage
>>> * CPU time
>>> * mcsmear File Sizes
>>> * REST File Sizes
>>> * Another Bad File
>>> * Sum of parentid=0
>>> * Correlation of CDC hits
>>> * Correlation of FDC hits
>>> * pπ^+π^- Events
>>>
>>> Proposed Schedule
>>>
>>> The schedule has slipped. The new schedule is as follows:
>>> 1. Launch of Data Challenge Thursday March 6, 2014 (est.).
>>> 2. Test jobs going successfully by Tuesday March 4.
>>> 3. Distribution ready by Monday March 3.
>>>
>>> Justin pointed out that the short REST file problem might be something
>>> that we could live with for this data challenge.
>>>
>>> Richard asked that Mark assign run numbers and run conditions for the
>>> various sites.
>>>
>>> Action Items
>>>
>>> 1. Understand random number seed system.
>>> 2. Solve ZFATAL crashes.
>>> 3. Make a table of conditions vs. sites where the entries are assigned
>>> file numbers.
>>> 4. Report the broken Polycom in L207.
>>>
>>> Retrieved from
>>> "
>>> https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014
>>> "
>>>
>>> References
>>>
>>> 21.
>>>
>>> https://mailman.jlab.org/pipermail/halld-offline/2014-February/001511.html
>>> 22.
>>>
>>> https://mailman.jlab.org/pipermail/halld-offline/2014-February/001512.html
>>> 23.
>>>
>>> https://halldweb1.jlab.org/wiki/images/f/f8/DC2-Meeting-sdobbs-20140228.pdf
>>> 24. https://halldweb1.jlab.org/wiki/images/e/ea/2014-02-28-DC2.pdf
>>>
>>> _______________________________________________
>>> Halld-offline mailing list
>>> Halld-offline at jlab.org
>>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>>
>>
>>
>>
>>
>>  --
>> Sean Dobbs
>> Department of Physics & Astronomy
>> Northwestern University
>> phone: 847-467-2826
>>  <times.pdf>_______________________________________________
>>
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>>
>>
>>  _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>>
>>
>>
>

-- 
Sean Dobbs
Department of Physics & Astronomy
Northwestern University
phone: 847-467-2826
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20140305/db0ffdb6/attachment-0002.html>