[Halld-offline] Data Challenge Meeting Minutes, March 7, 2014
Mark Ito
marki at jlab.org
Sun Mar 9 21:14:11 EDT 2014
Folks,
Find the minutes below and at
https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_March_7,_2014#Minutes
-- Mark
____________________________________________
GlueX Data Challenge Meeting, March 7, 2014
Minutes
Present:
* CMU: Paul Mattione
* FSU: Aristeidis Tsaris
* JLab: Mark Dalton, Mark Ito (chair), Chris Larrieu, Beni Zihlmann
* MIT: Justin Stevens
* NU: Sean Dobbs
* UConn: Richard Jones
Find a recording of the meeting [25]here.
Random Number Seed Status
Mark led us through his [26]wiki page with notes from an interview with
David Lawrence. Changes may be coming.
ZFATAL Fix
Richard gave us a few more details about the fix, beyond [27]his email.
There is a maximum of 64,000 pointers for Zebra memory management. We
were broaching this at high beam rate with EM background turned on.
Some time ago Beni introduced a change where secondaries are put on the
primary stack in order to track their genealogy. This change has proved
useful. At the same time, Beni also exempted particles produced in
showers in the calorimeters; there are too many of those and their
parentage is generally not of interest. With EM background turned on,
showers in the beam collimators can occur, but before now no exemption
was in place for them. Richard put in this needed exemption.
We agreed that this fixes the ZFATAL issue that Paul had reported
previously.
Short File Issue and Non-Reproducible Results
Paul went through [28]his recent email on non-reproducibility of the
code. This may or may not be related to the short file issue, but needs
addressing in any case.
Chris noted that often results can be non-reproducible due to
off-by-one errors, where uninitialized array elements are accessed by
mistake.
Justin's sent out [29]email to the group where he concludes that
enabling the compression of the REST output is indeed correlated with
short REST files. Mark confirms this in his recent running at JLab.
Richard is working on fixing this problem.
Running at CMU
Paul mentioned that the b1pi test is broken again.
He reported job on run times:
photon rate k events hours
No EM 50 12
1×10^7 25 24
5×10^7 5 18
We will probably have to cut back on the amount of high intensity
running we do.
Running at NU
Sean has been running jobs with 20 k events and seeing execution times
of 12 to 18 hours. He also sees short REST files.
Running at MIT
More CPUs have been added to the MIT cluster. In addition, he has
started using nodes on FutureGrid, which also uses OpenStack.
Running at JLab
SciComp will be able to move the nodes what we have been lending to the
High Performance Computing (HPC) cluster with a few days notice. Once
we get going we can ask for more nodes from HPC. The plan is to provide
us with 1250 cores.
SciComp is very reluctant to allow us to install the SRM at JLab even
if we provide the manpower. They would need to review the security
properties in order to approve usage and they do not have the manpower
to do that in the near term. They are encouraging us to use Globus
Online to transfer files in and out of JLab. Richard told us that
Globus Online is not appropriate for our use case, in particular doing
transfers in batch mode. Chris asked if the Globus Online command-line
tool provides the needed functionality, but Richard did not think that
it did.
Running at FSU
Aristeidis presented [30]statistics on jobs he has run at FSU. He also
reported some problems with building the latest versions of the code.
Richard suggested that he take a look at gridmake.
Run Number Assignments
Mark showed recent additions to the [31]data challenge conditions page.
He has added run number assignment to the proposed running conditions
and file numbers assignments for the various sites. Richard asked for a
larger file number range for the OSG.
We also decided that the number of events in each run (and thus in each
file) should depend on the conditions at the individual sites; we will
not try to make all files the same size, as we did last time. We will
have to add that additional degree of freedom to our bookkeeping.
Next Meeting
We agreed that in a week from now we will make a go/no-go decision on
whether we are ready to start. If all known problems are solved then
there is no issue; if some remain we will have to discuss whether they
are important enough to delay launch.
Retrieved from
"https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_March_7,_2014"
References
25.
https://halldweb1.jlab.org/talks/2014-1Q/data_challenge_2014-03-07/index.htm
26.
https://halldweb1.jlab.org/wiki/index.php/Random_Number_Seeds_Procedure_(as_of_2014-03-07)
27.
https://mailman.jlab.org/pipermail/halld-offline/2014-March/001527.html
28.
https://mailman.jlab.org/pipermail/halld-offline/2014-March/001536.html
29.
https://mailman.jlab.org/pipermail/halld-offline/2014-March/001532.html
30. http://hadron.physics.fsu.edu/~aristeidis/offline_challenge.pdf
31.
https://halldweb1.jlab.org/data_challenge/02/conditions/data_challenge_2.html
--
Mark M. Ito, Jefferson Lab, marki at jlab.org, (757)269-5295
More information about the Halld-offline
mailing list