[Halld-offline] Offline Software Meeting Minutes, May 25, 2016

Wed Jun 1 12:28:23 EDT 2016

Folks,

Find the minutes below and at https://goo.gl/fMao1Y .

   -- Mark
_________________

  GlueX Offline Meeting, May 25, 2016, Minutes

You can view a recording of this meeting <https://bluejeans.com/s/9JCK/> 
on the BlueJeans site.

Present:

  * *CMU*: Curtis Meyer
  * *FIU*: Mahmoud Kamel
  * *IU*: Matt Shepherd
  * *JLab*: Alexander Austregesilo, Amber Boehnlein, Graham Heyes, Mark
    Ito (chair), David Lawrence, Paul Mattione, Sandy Philpott, Nathan
    Sparks, Justin Stevens, Adesh Subedi, Simon Taylor, Chip Watson
  * *MIT*: Christiano Fanelli
  * *NU*: Sean Dobbs

      Review of [GlueX Offline Meeting, April 27, 2016#Minutes|minutes
      from April 27]]

  * Reminder: we switch to GCC greater or equal to 4.8 on June 1.
  * Mark raised the question of whether corrections to data values in
    the RCDB should properly be reported as a software issue on GitHub.
    An alternate forum was not proposed. We left it as something to
    think about.

      Announcement: Write-Through Cache

  * Mark reminded us about Jie Chen's email
    <https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/2016q2/000124.html>
    announcing the switch-over to write-through cache in favor of the
    read-only cache. The change moves toward symmetrization of
    operations between LQCD and ENP.
  * Mark cautioned us that we should treat this as an interface to the
    tape library, and not as infinite "work" disk. This means we should
    not be writing many small files to it and we need to think carefully
    about the implied directory structure in the /mss tree that will
    result from new directories created on the cache disk.
  * Chip pointed out that the write-through cache facilitates writing
    data from off-site to the tape library by allowing a pure disk
    access to do the job.
  * The size threshold for writing to tape will be set to zero. Other
    parameters will be documented on the SciComp web pages.
  * At some point we need to do a purge of small files already existing
    on the cache. These will not be a problem for a few months, so we
    have some time to get around to it.
  * There is an ability to write data immediately to tape and optionally
    delete it from the cache.
  * At some point in the future, Paul will give us a talk on
    best-practices for using the write-through cache.

      Copying data to off-site locations

        Data Copying

We went over the recent email thread centered around bandwidth for 
transferring data off-site (expected bandwidth offsite) 
<https://mailman.jlab.org/pipermail/halld-offline/2016-May/thread.html#start>. 

  * Matt had started the thread by asking what the maximum possible
    transfer rate might be.
  * Concern was expressed that if we truly wanted to go as fast as
    possible, that might cause back-ups on-site for data transfers
    related to farm operation.
  * Numbers from Chip:
      o network pipe to the outside: 10 Gbit/s
      o tape bandwidth: 2 GByte/s
      o disk access: 10-20 GByte/s
  * If collaborators were to try to pull data (indirectly) from tape in
    a way that saturated the network going out, that could use roughly
    half of the tape bandwidth. However, that scenario is not very
    likely. The interest is in REST data, and those data are produced at
    a rate much, much less than the full bandwidth of the tape system.
    We had agreed that an average rate of 100 MB per second was
    sufficient for the collaboration and easily provided given the
    current configuration at the Lab. Special measures to throttle use
    will probably not be necessary.
  * Matt's offer of resources he gets for quasi-free from IU could be
    used to create a staging site for data off-site from which other
    collaborating institutions can draw the data. This would off-load
    much of the demand from the Lab potentially by a factor of several.
  * These points were largely settled in the course of the email discussion.

        Discussion with OSG Folks

Amber, Mark, and Richard Jones had a video-conference on Monday (May 23) 
with Frank Wuerthwein and Rob Gardner of the Open Science Grid to 
discuss strengthening OSG-related capabilities at JLab.

  * Frank proposed working on two issues: job submission from JLab and
    data transfer in and out of the Lab using OSG resources and tools.
    Initially he proposed getting the job submission going first, but
    after hearing about our need to ship REST data off-site, modified
    the proposal to give the efforts equal weight.
  * Richard was able to fill in the OSG guys on what was already in
    place and what has been done in the past with the GlueX Virtual
    Organization (VO).
  * For data transfer, XROOTD was identified as a likely technology to
    use despite the fact that REST data is not ROOT based.
  * Rob and Frank will go away and think about firming up details of the
    proposal in both areas given their understanding of our
    requirements. They will get back to us when they have a formulation.

        Future Practices for Data Transfer

Amber encouraged us to think about long-term solutions to the data 
transfer problem, rather than focusing exclusively on the "current 
crisis". In particular we should be thinking about solutions that are 
(a) appropriate for the other Halls and (b) that have robust support in 
the HENP community generally. The LHC community has had proven success 
in this area and we are in a good position to leverage their experience.

To do this will require more detailed specification of requirements than 
we have done thus far as well as communication of such requirements 
among the Halls. With this planning, possible modest improvements in 
infrastructure and on-site support of activities can proceed with 
confidence.

We all agreed that this was a pretty good idea.

      Disk space requirements for Spring 2016 data

Sean discussed the disk space being taken up by skims from the recent 
data run. See his table 
<https://halldweb.jlab.org/wiki/index.php/Calibration_Train#Run_Groups> 
for the numbers [FCAL and BCAL column headings should be reversed.]

  * Currently, all of the pair spectrometer skims are pinned. Files need
    to be present on the disk before a Globus Online request for copy
    can be made against them. Richard is trying to get them all to UConn.
  * The FCAL and BCAL skims have likely been wiped off the disk by the
    auto-deletion algorithm.
      o Both can be reduced in size by dropping some of the tags. Sean
        has code to do that, but it needs more testing before being put
        into production.
  * The total size of all skims is about 10% of the raw data.

We discussed various aspects of managing our Lustre disk space. I looks 
like for this summer, we will need about 200 TB of space, counting 
volatile, cache, and work. The system is new to us, we have some 
learning to do before we can use it efficiently.

      Spring 2016 Run, Processing Plans

Paul noted that this was discussed fully at yesterday's Analysis 
Meeting. In summary we will produce:

  * reconstructed data
  * skims of raw data
  * TTrees
  * EventStore meta-data

In additions job submission scripts are in need of some re-writing.

        Meta-data for this processing launch

Sean presented a system for classifying and documenting key aspects of 
the various data sets that we will have to handle. He guided us through 
a web page he put together that displays the information 
<https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/dataVersions.py>. 
It has a pull-down menu to choose the run period being queried. There is 
a legend at the bottom that describes each of the fields.

This system has already been in use for the monitoring launch and the 
information is used in the monitoring web pages to navigate the 
different versions of those launches. Also the data will be used to 
correlate data sets with the EventStore. Sean is proposing that the data 
version string be written into each REST file so that there is a two-way 
link between EventStore meta data and the data itself.

Mark suggested that we might want to have a more formal relational 
database structure for the information. This would require some 
re-writing of a working system but may be worth the effort.

Sean has written a GlueX Note motivating and documenting the system 
<http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=3062>. 

      Negative Parity in Tracking code

David is working on new code to parse the EVIO data. In the course of 
that work he was comparing results between and old and new parser and 
noticed some small differences. See his slides 
<https://halldweb.jlab.org/wiki/images/f/fc/20160525_lawrence_tracking.pdf> 
for all of the details. He tracked these down to several issues. Some of 
them have to do with the new parser presenting the hits to the 
reconstruction code in a different order than that presented by the old 
parser.

The issues were:

 1. In the track finding code, there is an assumption about the order of
    hit, although that order was never enforced in the code. That was
    causing a variance used as a cut criterion used for hit inclusion on
    a track to be calculated on different numbers of hits, depending on
    order.
 2. In FDC pseudo hit creation the "first hit" on a wire was chosen for
    inclusion. That choice was manifestly hit-order dependent.
 3. There was a bug in the FDC Cathode clustering code that caused
    multiple hits on a single cathode to be split onto different
    clusters. That bug manifested itself in a way that depended on hit
    order.
 4. For some events, a perfectly fine start counter hit was ignored in
    determining the drift start time for a track. That was a result of
    the reference trajectory in the fringe field area being calculated
    using a field that depended on the history of the reference
    trajectory object's recycling history. In certain cases a bad,
    left-over field value would give a trajectory where the intersection
    of the track with the projection of the start counter plane in the
    r-phi view was unacceptably far downstream of the physical start
    counter in z.

These have all been fixed on a branch. David will look at moving the 
changes onto the master branch.

      HDGeant4 Workflow

At the collaboration meeting Mark talked to Richard about developing a 
[HDGeant4 news work flow for HDGeant4] that would allow early testing by 
collaborators as well as controlled contributions. Mark reported that 
they agreed to a plan in principle 
<https://docs.google.com/document/d/1bjMoszd5fJuiuh3DDJGgnTOnxFowObYSsP6ndVhOmF0/edit?usp=sharing> 
last week.

Note that this work flow is not the same as that we have been using for 
sim-recon, in particular it will require contributors to compose pull 
requests based on a private clone of the HDGeant4 repository on GitHub 
rather than from a branch of the "Jefferson Lab" repository. Most 
collaborators will not have privilege to create such a branch on the 
JLab repository.

Retrieved from 
"https://halldweb.jlab.org/wiki/index.php?title=GlueX_Offline_Meeting,_May_25,_2016&oldid=75392"

  * This page was last modified on 1 June 2016, at 12:20.

-- 
marki at jlab.org, (757)269-5295

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20160601/0fe4c39d/attachment-0001.html>