[Halld-offline] Data Challenge Meeting Minutes, March 7, 2014

Sun Mar 9 21:14:11 EDT 2014

Folks,

Find the minutes below and at

https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_March_7,_2014#Minutes

   -- Mark
____________________________________________

GlueX Data Challenge Meeting, March 7, 2014
Minutes

    Present:
      * CMU: Paul Mattione
      * FSU: Aristeidis Tsaris
      * JLab: Mark Dalton, Mark Ito (chair), Chris Larrieu, Beni Zihlmann
      * MIT: Justin Stevens
      * NU: Sean Dobbs
      * UConn: Richard Jones

    Find a recording of the meeting [25]here.

Random Number Seed Status

    Mark led us through his [26]wiki page with notes from an interview with
    David Lawrence. Changes may be coming.

ZFATAL Fix

    Richard gave us a few more details about the fix, beyond [27]his email.

    There is a maximum of 64,000 pointers for Zebra memory management. We
    were broaching this at high beam rate with EM background turned on.
    Some time ago Beni introduced a change where secondaries are put on the
    primary stack in order to track their genealogy. This change has proved
    useful. At the same time, Beni also exempted particles produced in
    showers in the calorimeters; there are too many of those and their
    parentage is generally not of interest. With EM background turned on,
    showers in the beam collimators can occur, but before now no exemption
    was in place for them. Richard put in this needed exemption.

    We agreed that this fixes the ZFATAL issue that Paul had reported
    previously.

Short File Issue and Non-Reproducible Results

    Paul went through [28]his recent email on non-reproducibility of the
    code. This may or may not be related to the short file issue, but needs
    addressing in any case.

    Chris noted that often results can be non-reproducible due to
    off-by-one errors, where uninitialized array elements are accessed by
    mistake.

    Justin's sent out [29]email to the group where he concludes that
    enabling the compression of the REST output is indeed correlated with
    short REST files. Mark confirms this in his recent running at JLab.

    Richard is working on fixing this problem.

Running at CMU

    Paul mentioned that the b1pi test is broken again.

    He reported job on run times:
    photon rate k events hours
    No EM       50       12
    1×10^7      25       24
    5×10^7      5        18

    We will probably have to cut back on the amount of high intensity
    running we do.

Running at NU

    Sean has been running jobs with 20 k events and seeing execution times
    of 12 to 18 hours. He also sees short REST files.

Running at MIT

    More CPUs have been added to the MIT cluster. In addition, he has
    started using nodes on FutureGrid, which also uses OpenStack.

Running at JLab

    SciComp will be able to move the nodes what we have been lending to the
    High Performance Computing (HPC) cluster with a few days notice. Once
    we get going we can ask for more nodes from HPC. The plan is to provide
    us with 1250 cores.

    SciComp is very reluctant to allow us to install the SRM at JLab even
    if we provide the manpower. They would need to review the security
    properties in order to approve usage and they do not have the manpower
    to do that in the near term. They are encouraging us to use Globus
    Online to transfer files in and out of JLab. Richard told us that
    Globus Online is not appropriate for our use case, in particular doing
    transfers in batch mode. Chris asked if the Globus Online command-line
    tool provides the needed functionality, but Richard did not think that
    it did.

Running at FSU

    Aristeidis presented [30]statistics on jobs he has run at FSU. He also
    reported some problems with building the latest versions of the code.
    Richard suggested that he take a look at gridmake.

Run Number Assignments

    Mark showed recent additions to the [31]data challenge conditions page.
    He has added run number assignment to the proposed running conditions
    and file numbers assignments for the various sites. Richard asked for a
    larger file number range for the OSG.

    We also decided that the number of events in each run (and thus in each
    file) should depend on the conditions at the individual sites; we will
    not try to make all files the same size, as we did last time. We will
    have to add that additional degree of freedom to our bookkeeping.

Next Meeting

    We agreed that in a week from now we will make a go/no-go decision on
    whether we are ready to start. If all known problems are solved then
    there is no issue; if some remain we will have to discuss whether they
    are important enough to delay launch.

    Retrieved from
"https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_March_7,_2014"

References

   25. 
https://halldweb1.jlab.org/talks/2014-1Q/data_challenge_2014-03-07/index.htm
   26. 
https://halldweb1.jlab.org/wiki/index.php/Random_Number_Seeds_Procedure_(as_of_2014-03-07)
   27. 
https://mailman.jlab.org/pipermail/halld-offline/2014-March/001527.html
   28. 
https://mailman.jlab.org/pipermail/halld-offline/2014-March/001536.html
   29. 
https://mailman.jlab.org/pipermail/halld-offline/2014-March/001532.html
   30. http://hadron.physics.fsu.edu/~aristeidis/offline_challenge.pdf
   31. 
https://halldweb1.jlab.org/data_challenge/02/conditions/data_challenge_2.html

-- 
Mark M. Ito, Jefferson Lab, marki at jlab.org, (757)269-5295