[Halld-offline] Offline Software Meeting, April 2, 2014

Fri Apr 4 09:56:25 EDT 2014

Folks,

Find the minutes below and at

https://halldweb1.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_April_2,_2014#Minutes

   -- Mark
______________________________________________________

GlueX Offline Meeting, April 2, 2014
Minutes

    Present:
      * CMU: Paul Mattione, Curtis Meyer
      * IU: Kei Moriya
      * JLab: Mark Ito (chair), Sandy Philpott, Dmitry Romanov, Simon
        Taylor, Beni Zihlmann
      * MIT: Justin Stevens
      * NU: Sean Dobbs

Review of Minutes from the Last Meeting

    We looked over the [24]minutes from March 19. Sean has done some work
    wrapping HDDM calls for use with Python[?], as part of exploring the
    use of EventStore.

Data Challenge Meeting Report, March 28

    We also looked over [25]these minutes as well. Some of Mark's comments
    (see below) addressed issues raised last Friday.

Plot of Running DC2 Jobs as a Function of Time at JLab

    Mark showed a plot:

    [26]Jobs gluex.png

    We have borrowed another 1000 cores from the LQCD farm, bringing our
    share up to about 4000 cores. This last slug came in over the last
    couple of days.

    The large fluctuations are due to the fact that the farm scheduler
    cannot take into account usage by a user until the end of jobs. At
    start-up, if jobs take 24 hours to run (as these do), then during that
    initial period usage from those jobs is assumed to be zero. Also during
    this period, the user is boosted in priority in order to make up for
    lack of usage in the recent past. After the jobs complete are done, and
    all of that usage accounted for, the user appears to be over quota and
    gets turned off for a while, and so on. This turns out to be completely
    normal behavior for the system given long-standing parameter settings.

Comments on DC2 Issues

    Mark led us through his wiki page, commenting on three topics:
     1. Monitoring quality of the current data challenge
           + We decided that we would by hand look at the monitoring
             histograms we are producing for each job, for every 1000th
             job. Simon will do the looking at JLab. Sean will share a
             script he has written to compare histograms to standards. This
             should help.
     2. File transfers in and out of JLab
           + Sean thought that that the Globus Online options would not
             work for pushing files to SRM-capable sites. He thought that
             the SRM client tools would be sufficient if they could be
             installed at JLab. He also suggested that we look into raw
             GridFTP (as Chip Watson has suggested in the past).
     3. Event Tally Board
           + We agreed to maintain a [27]Data Challenge 2 Event Tally Board
             to keep track of progress.

Returning Nodes to LQCD

    We had a brief discussion on how long we should be using the nodes we
    have borrowed from LQCD. We still have a substantial balance on the
    amount owed to Physics from the December-March loan to LQCD. Curtis
    pointed out that we have already hit a 4500 job milestone, exceeded the
    benchmark of 1250 cores that had been set for us. Mark pointed out that
    the cores are all doing useful work. The OSG "site" has not come online
    yet. Given that the OSG contributed 80% of the cycles for the last data
    challenge it is hard so say where we are now.

    We did not come to firm decision but will have to revisit this every
    few days or so. For now we continue to run with the 4000 total cores.

REST Filesizes and Reproducibility

    Kei presented recent studies he has done comparing repeated
    reconstruction runs on the same smeared event file. See [28]his slides
    for details. His slides covered:
      * Output file sizes
      * A bad log file (hdgeant)
      * Run info (cpu time, virtual memory)
      * File size correlation (mcsmear, iteration to iteration)
      * File size correlation (REST, iteration to iteration)
      * File size correlation at IU
      * hd_dump of factories (comparison of iterations)
      * Single different event
      * File size correlation with CMU (REST)

    We remarked that numerical differences observed are truly in the
    round-off error regime. We also thought it was odd that identical runs
    at IU and CMU should differ by as much as 1% in file size. Progress
    from this point looks difficult given the small differences being
    reported. Kei will continue his studies.

Next Data Challenge Meeting

    We agreed to [29]meet again on Friday to update tallies and discuss
    schedule.

    Retrieved from
"https://halldweb1.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_April_2,_2014"

References

   24. 
https://halldweb1.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_March_19,_2014#Minutes
   25. 
https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_March_28,_2014#Minutes
   26. https://halldweb1.jlab.org/wiki/index.php/File:Jobs_gluex.png
   27. 
https://halldweb1.jlab.org/wiki/index.php/Data_Challenge_2_Event_Tally_Board
   28. https://halldweb1.jlab.org/wiki/images/4/4e/2014-04-02-dc2.pdf
   29. 
https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_April_4,_2014

-- 
Mark M. Ito, Jefferson Lab, marki at jlab.org, (757)269-5295