<div dir="ltr">Update from UConn,<div><br></div><div>Am having some difficulty reproducing the short jobs problem.  Postponing this issue until others are able to show explicit example where it reproducibly occurs.  Meanwhile, busy fixing the hdgeant ZFATAL crashes, will issue a fix shortly.  Hint: it is not just a simple matter of increasing the size of the zebra store.  More details to follow.</div>

<div><br></div><div>-Richard J.</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Mar 5, 2014 at 10:14 AM, Paul Mattione <<a href="mailto:pmatt@jlab.org">pmatt@jlab.org</a>> wrote:<br>

<blockquote class="gmail_quote"><div><div>Another update from CMU:</div><div><br></div><div>Ran 63 jobs, 50k-events each, with no EM background, with 2 GB memory assigned for the jobs.  No crashes, but I have 5 small REST files.  So apparently it's still a problem.  Maybe I just got lucky on the last batch (also, note that there were 1/2 as many events last time).  </div>

<font color="#888888"><div><br></div><div> - Paul</div></font><div><div class="h5"><br><div><div>On Mar 4, 2014, at 10:42 AM, Paul Mattione wrote:</div><br><blockquote type="cite"><div>

<div>Update from CMU:</div><div><br></div><div>Ran 100 jobs, 25k-events each, at BGRATE 1.1, BGGATE -800, 800, with 2 GB of memory assigned for the jobs.  All jobs took 23-24 hours to run.  No crashes, all stages 100% OK (no small REST or hdgeant files).  REST file compression was enabled.  </div>

<div><br></div><div>Could the small REST file problem have been related to the memory spikes that were happening in the CDC?  Has anyone seen small REST files with the latest version of the branch code?  </div><div><br></div>

<div>The hdgeant crashes are a much bigger problem at a BGRATE 5.5; I'm going to run some test jobs at that rate again and make sure I can reproduce it.  </div><div><br></div><div> - Paul</div><br><div><div>On Mar 3, 2014, at 12:11 PM, Sean Dobbs wrote:</div>

<br><blockquote type="cite"><div dir="ltr"><div class="gmail_quote"><br><div dir="ltr">Hi all,<div><br></div><div>An update from my end:  A few jobs are still making their way through the queue, but I've attached a PDF of some interesting job time distributions.  I ran 500 jobs simulating 10k events each with the EM background parameters agreed to at the meeting (BGRATE 1.1, BGGATE -800 800).  The running times for mcsmear and hd_ana are stable, but I'm seeing a large distribution of running time for hdgeant.  The tails to larger running time (>7 hours) are from the newer nodes with hyperthreading turned on, so I assume the generation of the EM background is CPU-bound.</div>


<div><br></div><div>I'm planning to benchmark some jobs with 20k events, but am trying to reproduce some of these other problems first.</div><div><br></div><div><br></div><div>Regarding some of the issues I mentioned at the meeting: I had gotten the  job failure rate down to ~5%, but eventually figured out that the cause of the jobs failing here was due to forgetting that the thread timeout parameters for the first event and all other events are different.  The reason that these jobs would sometimes not exit properly and hang around in the queue was because they were waiting on a lock in the destructor for DGeometry - I imagine this was because the main thread was in the middle of some tracking calculation using the material maps when it was killed.</div>


<div><br></div><div>Also, I can report that hgeant with EM simulation on runs fine with on machines with ~500 MB physical memory per core, but runs very slow on machines with ~250 MB per core, which isn't terribly surprising, I imagine (we have a few of these).</div>


<div><br></div></div><div><div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Mar 2, 2014 at 9:02 PM, Mark Ito <<a href="mailto:marki@jlab.org">marki@jlab.org</a>> wrote:<br>


<blockquote class="gmail_quote">Colleagues,<br>

<br>

Please fine the minutes below and at<br>

<br>

<a href="https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014#Minutes">https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014#Minutes</a><br>


<br>

-- Mark<br>

_________________________________________________________<br>

<br>

GlueX Data Challenge Meeting, February 28, 2014<br>

Minutes<br>

<br>

Present:<br>

* CMU: Paul Mattione, Curtis Meyer<br>

* FSU: Volker Crede, Priyashree Roy, Aristeidis Tsaris,<br>

* IU: Kei Moriya<br>

* JLab: Mark Dalton, Mark Ito (chair), Chris Larrieu, Simon Taylor<br>

* NU: Sean Dobbs<br>

* UConn: Richard Jones<br>

<br>

Announcements<br>

<br>

* Mark announced an [21]update of the branch. Changes include:<br>

1. I fix from Simon for single-ended TOF counters.<br>

2. Improvements from Paul for cutting off processing for<br>

multi-lap curling tracks.<br>

3. A change from David Lawrence to [22]allow compression to be<br>

turned off in producing REST format data.<br>

o David noticed that all three of the programs hdgeant,<br>

mcsmear, and DANA produced HDDM-like output, but only<br>

DANA has compression turned on (REST data in this case).<br>

This feature will allow us to test if this has anything<br>

to do with short REST files. On a side note, David<br>

reported that the short-REST-file was not reproducible.<br>

Mark produced some example hdgeant_smeared.hddm files<br>

that produced short output for him to test.<br>

<br>

Running Jobs at JLab<br>

<br>

Mark has submitted some test jobs against the new branch. [Added in<br>

press: 1,000 50 k-event jobs have been submitted.]<br>

<br>

Status of Preparations<br>

<br>

Random number seeds procedure<br>

<br>

Paul spoke to David about this. It seems that mcsmear is currently<br>

generating its own random number seed. We still have details to fill in<br>

on this story.<br>

<br>

Running Jobs at FSU<br>

<br>

FSU has started running data challenge test jobs on their cluster.<br>

Aristeidis has started with 50 jobs, but an early look shows problems<br>

with some of them in hd_root. Also there was the GlueX-software-induced<br>

crash of the FSU cluster[?].<br>

<br>

Running jobs at CMU<br>

<br>

Paul is seeing ZFATAL errors from hdgeant. He will send a bug report to<br>

Richard who will look into a fix beyond merely increasing ZEBRA memory.<br>

<br>

Richard asked about an issue where JANA takes a long time to identify a<br>

CODA file as not an HDDM file. Richard would like to fix the HDDM<br>

parser such that this is not the case. Mark D. will send Richard an<br>

example.<br>

<br>

Running Jobs at NU<br>

<br>

Sean regaled us with tales of site specific problems.<br>

<br>

Lots of jobs crashed at REST generation. Site configuration changes<br>

helped. But there were still a lot of jobs hanging, usually with new<br>

nodes. Reducing the number of submit slots fixed most of the problems.<br>

Many of the remaining symptoms were jobs hung on the first event when<br>

accessing the magnetic field. Jobs are single-threaded. [23]Some<br>

statistics on the results were presented as well.<br>

<br>

Richard remarked that on the OSG, jobs will start much faster if<br>

declared as single-threaded.<br>

<br>

Richard proposed the following standards:<br>

<br>

BGRATE 1.1 (equivalent to 10^7) BGGATE -800 800 (in ns, time gate for<br>

EM background addition)<br>

<br>

We agreed on these as standard settings.<br>

<br>

Mark proposed the following split of running:<br>

<br>

15% with no EM background 70% with EM background corresponding to 10^7<br>

15% with EM background corresponding to 5\×10^7<br>

<br>

There was general agreement; adjustment may happen in the future.<br>

<br>

Running Jobs at MIT<br>

<br>

Justin has been running with the dc-2.2 tag. The OpenStack cluster at<br>

MIT has about 180 cores and he has been running jobs for a couple of<br>

days with good success. BGGATE was set at -200 to 200.<br>

<br>

Electromagnetic Background<br>

<br>

Kei gave us an [24]update on his studies of EM background with hdds-2.0<br>

and sim-recon-dc-2.1. Slides covered:<br>

* Memory Usage<br>

* CPU time<br>

* mcsmear File Sizes<br>

* REST File Sizes<br>

* Another Bad File<br>

* Sum of parentid=0<br>

* Correlation of CDC hits<br>

* Correlation of FDC hits<br>

* pπ^+π^- Events<br>

<br>

Proposed Schedule<br>

<br>

The schedule has slipped. The new schedule is as follows:<br>

1. Launch of Data Challenge Thursday March 6, 2014 (est.).<br>

2. Test jobs going successfully by Tuesday March 4.<br>

3. Distribution ready by Monday March 3.<br>

<br>

Justin pointed out that the short REST file problem might be something<br>

that we could live with for this data challenge.<br>

<br>

Richard asked that Mark assign run numbers and run conditions for the<br>

various sites.<br>

<br>

Action Items<br>

<br>

1. Understand random number seed system.<br>

2. Solve ZFATAL crashes.<br>

3. Make a table of conditions vs. sites where the entries are assigned<br>

file numbers.<br>

4. Report the broken Polycom in L207.<br>

<br>

Retrieved from<br>

"<a href="https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014">https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_February_28,_2014</a>"<br>


<br>

References<br>

<br>

21.<br>

<a href="https://mailman.jlab.org/pipermail/halld-offline/2014-February/001511.html">https://mailman.jlab.org/pipermail/halld-offline/2014-February/001511.html</a><br>

22.<br>

<a href="https://mailman.jlab.org/pipermail/halld-offline/2014-February/001512.html">https://mailman.jlab.org/pipermail/halld-offline/2014-February/001512.html</a><br>

23.<br>

<a href="https://halldweb1.jlab.org/wiki/images/f/f8/DC2-Meeting-sdobbs-20140228.pdf">https://halldweb1.jlab.org/wiki/images/f/f8/DC2-Meeting-sdobbs-20140228.pdf</a><br>

24. <a href="https://halldweb1.jlab.org/wiki/images/e/ea/2014-02-28-DC2.pdf">https://halldweb1.jlab.org/wiki/images/e/ea/2014-02-28-DC2.pdf</a><br>

<br>

_______________________________________________<br>

Halld-offline mailing list<br>

<a href="mailto:Halld-offline@jlab.org">Halld-offline@jlab.org</a><br>

<a href="https://mailman.jlab.org/mailman/listinfo/halld-offline">https://mailman.jlab.org/mailman/listinfo/halld-offline</a></blockquote></div><br></div>

</div></div></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr">Sean Dobbs<br>Department of Physics & Astronomy <br>Northwestern University<br>phone: <a href="tel:847-467-2826">847-467-2826</a></div>

</div>

<times.pdf>_______________________________________________<br>Halld-offline mailing list<br><a href="mailto:Halld-offline@jlab.org">Halld-offline@jlab.org</a><br><a href="https://mailman.jlab.org/mailman/listinfo/halld-offline">https://mailman.jlab.org/mailman/listinfo/halld-offline</a></blockquote>

</div><br></div>_______________________________________________<br>Halld-offline mailing list<br><a href="mailto:Halld-offline@jlab.org">Halld-offline@jlab.org</a><br><a href="https://mailman.jlab.org/mailman/listinfo/halld-offline">https://mailman.jlab.org/mailman/listinfo/halld-offline</a></blockquote>

</div><br></div></div></div><br>_______________________________________________<br>

Halld-offline mailing list<br>

<a href="mailto:Halld-offline@jlab.org">Halld-offline@jlab.org</a><br>

<a href="https://mailman.jlab.org/mailman/listinfo/halld-offline">https://mailman.jlab.org/mailman/listinfo/halld-offline</a><br></blockquote></div><br></div>