[Halld-offline] Offline Software Meeting Minutes, April 16, 2014

Fri Apr 18 10:18:38 EDT 2014

Folks,

Please find the minutes below and at

https://halldweb1.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_April_16,_2014#Minutes

   -- Mark
_____________________________________________

GlueX Offline Meeting, April 16, 2014
Minutes

    Present:
      * CMU: Paul Mattione
      * FSU: Volker Crede, Aristeidis Tsaris
      * IU: Kei Moriya
      * JLab: Mark Ito (chair), David Lawrence, Curtis Meyer, Dmitry
        Romanov, Simon Taylor
      * MIT: Justin Stevens
      * NU: Sean Dobbs
      * UConn: Richard Jones [and others?]

Review of Minutes from the Last Meeting

    We went over the [34]minutes of the April 2 meeting.

    Kei commented on continued work on his study of REST file sizes and
    reproducibility. He has sent problem files to Paul and Simon and asked
    for feedback. He also indicated that this project may go to back burner
    for now, given the size of differences seen thus far.

Data Challenge 2

Data Challenge Meeting Report, April 11

    Curtis re-capped the meeting.

    Things were running well, with a very low failure rate. The OSG was
    just starting up; there had been problems with a site in Brazil that
    was accepting jobs which would bomb right away. Most of the sites are
    finished or winding down (with the exception of the OSG).

Event Tally Board

    We took a look at the [35]board. We are now at about 5 billion events.
    Note that this is already as many events as we had for data challenge
    1, and these events are a factor of several more expensive than those
    in terms of CPU time.

Site Status Updates

    MIT. Justin is still running on about 300 cores which include the
    FutureGrid cores. At some point soon those will have to go back. He has
    done about 40 million events over the past few days.

    CMU. Paul has summarized results on [36]his wiki page, in section 5. He
    had only 3 failures in 7,000 jobs. He catalogs the reasons for those
    failures.

    JLab. Mark showed the updated [37]plot of running jobs as a function of
    time for the entire data challenge period. He also showed a [38]plot
    from Sandy Philpott showing all of jobs on the farm for the past three
    months. The steps in job numbers as nodes were switched from LQCD to
    the farm are clearly seen. Mark also reviewed [39]his message from
    Monday announcing ramp down of the data challenge at JLab and the
    return of nodes from the farm to LQCD.

    OSG. We looked at [40]recent job history on the Grid. Richard reported
    that he got a big batch of jobs through and that some of his grid
    proxies were getting stale and so he paused recently and is starting
    back up. The Purdue site has withdrawn its nodes indefinitely due to
    events in the aftermath of the Heartbleed bug. The GlueX sites (UConn
    and NU) have been operating at 98 to 99% efficiency
    ("productive"-CPU-time as fraction of wall-time). This is to be
    contrasted with 60% seen in DC-1. On some sites there was a problem
    seen where glide-ins were advertised to us, but the jobs were rejected
    due to renaming of the offered proxy from that used in previous
    running. This has been cleared up administratively. Support from the
    OSG has been very good. Once jobs start they generally run to
    completion. In general much smoother than last time. He reports on 2
    failures out of 0.5 mega-jobs. Richard plans to continue to run until
    he can get two or three days of steady-state, problem-free running. He
    will try to balance our run mix as well.

Kinematic Fitter Update

    Paul led us through [41]his email announcing changes to his analysis
    library and the kinematic fitter in particular. His email has a
    complete description of the changes and interested parties should look
    there for details.

    Kei asked about the recommended procedure for characterizing thrown
    particle topology. Paul told us that all of the thrown information is
    in the tree, but some navigation by hand is necessary to totally
    understand everything in the decay chain.

    Justin asked about reasonable cuts for matching between charged tracks
    and clusters. Paul has a five sigma cut by default. This has not been
    extensively studied. These studies will have to be done to optimize the
    cut and may depend on what one is trying to do.

    Justin also asked for clarification of the matching between the
    requested DReaction and the thrown information. For this there is no
    dependence on reconstructed information.

Data Distribution

    Richard gave us guidance on accessing data challenge data.
      * the OSG generated results from dc2 are being stored at /Gluex/test
        on the UConn SRM
      * the location on the Northwestern University SRM is
        /mnt/xrootd/gluex/dc2
      * for instructions on how to access files over SRM, see the
        [42]appropriate section of the howto.

    He emphasized the best-practice of never attempting an srmls on a data
    directory. Instead one should fetch the .ls-l file in each directory.
    He also told us that [43]XRootD is supported as a data transport
    convention on the SRM servers.

    He also proposed that if Globus Online is the only parallel transport
    method supported by JLab, then we should deploy it at non-JLab sites,
    in the spirit of good collaboration. He thought that Globus Online
    Personal would not be compatible with his node at UConn and so a
    license would have to be bought to make it a full-fledged end-point.
    The cost appears not to be prohibitive.

    We discussed continuing to explore options for transport, principally
    Globus Online, the OSG SRM, and raw GridFTP.

Skimming Data Challenge Data

    Paul showed us a [44]list of proposed skims that we could perform on
    the data challenge events. They basically cover the waterfront. They
    are grouped in to three broad categories: non-strange meson channels,
    strange meson channels, and hyperon channels. With the analysis library
    in place, implementing any of these skims is not a big effort. As he
    states on his page, the cuts may have to be studied before going into
    production.

    Given that we have his technology in hand, the discussion revolved
    around what we should do with it. The input is obviously the
    reconstructed REST data. The two possible approaches are:
     1. Do a traditional skim, writing out only selected events to a
        smaller output file.
     2. Implement the EventStore. Then multiple skims can be supported from
        a single set of files.

    We agreed that centralizing this function in the long run would
    eliminate duplicate application of manpower and waste of computing
    resources. We also noted that some of this processing could conceivably
    be done at non-JLab sites since it does not involve shipping around
    "raw" data.

    We will explore both approaches. Paul will do a few pilot skims as a
    demonstration. Sean will look at the EventStore.

Data Challenge Meetings?

    We decided to suspend the special data challenge meetings and put
    discussion back into the regular bi-weekly offline meeting.

Other Offline Items

    David directed our attention to other items that need attention now
    that the production part of the data challenge is over.
     1. Tagger reconstruction is not in HDGeant. Although tagger hits are
        there, based on the thrown photon energy (not a detailed particle
        swim) the step to turn it back into a photon energy has not been
        done.
     2. The online group is discussion moving online code to a separate
        subversion repository. Pluses and minuses of such a move have to be
        discussed in both online and offline working groups.
     3. The version of the GlueX wiki is getting kind of old. We are
        running 1.17, from 2011, and the latest is 1.22. David will look
        into a version refresh. There is a related issue: changing the
        authentication to the standard JLab LDAP scheme, which seems like a
        good idea. That move may or may not have an influence on how we
        proceed.

References

   34. 
https://halldweb1.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_April_2,_2014
   35. 
https://docs.google.com/spreadsheets/d/1qvF9B-76gr8NdsTKsO17jqL0qc5OXqK46JluvXnJ98k/edit?usp=sharing
   36. https://halldweb1.jlab.org/wiki/index.php/CMU_Data_Challenge_2
   37. https://halldweb1.jlab.org/wiki/images/d/d5/Jobs_gluex_04-16.png
   38. https://halldweb1.jlab.org/wiki/images/5/5b/Farm_2014.png
   39. 
https://mailman.jlab.org/pipermail/halld-offline/2014-April/001652.html
   40. https://halldweb1.jlab.org/wiki/images/a/a5/Grid_jobs.png
   41. 
https://mailman.jlab.org/pipermail/halld-physics/2014-April/000394.html
   42. 
https://halldweb1.jlab.org/wiki/index.php/Using_the_Grid#Accessing_stored_data_over_SRM
   43. http://xrootd.org/
   44. https://halldweb1.jlab.org/wiki/index.php/Mattione_Update_04212014

-- 
Mark M. Ito, Jefferson Lab, marki at jlab.org, (757)269-5295