[Halld-offline] Software Meeting Minutes, March 30, 2021

Mark Ito marki at jlab.org
Tue Mar 30 20:40:55 EDT 2021


Folks,

Please find the minutes here 
<https://halldweb.jlab.org/wiki/index.php/GlueX_Software_Meeting,_March_30,_2021#Minutes> 
and below.

   -- Mark

      _______________________________________________


    GlueX Software Meeting, March 30, 2021, Minutes

Present: Alexander Austregesilo, Thomas Britton, Sean Dobbs, Mark Ito 
(chair), Igal Jaegle, David Lawrence, Justin Stevens, Simon Taylor, 
Nilanga Wickramaarachchi, Beni Zihlmann

There is a recording of this meeting 
<https://bluejeans.com/s/ZrnFl4s1d9M/>. Log into the BlueJeans site 
<https://jlab.bluejeans.com> first to gain access (use your JLab 
credentials).


      Announcements

 1. SciComp Issue Tracking
    <https://halldweb.jlab.org/wiki-private/index.php/SciComp_Issue_Tracking>.
    Sean has put up a wiki page to collect problem reports regarding the
    farm, ifarm, and other SciComp resources at JLab.
      * Alex reported on a problem that has been fixed: SWIF jobs have
        recently been failing at a high rate (10-20% of jobs). The cause
        was traced back to slow database access. Jobs that had finished
        properly were timing out while trying to transmit their state to
        the database and thus were getting marked as "SWIF system
        error". Chris Larrieu discovered a way to increase the speed of
        the database queries by a factor of twelve, eliminating the
        time-outs and greatly improving the success rate. Alex will
        follow up with Chris and get more information on the miracle cure.
 2. New version set: version_4.37.0.xml
    <https://mailman.jlab.org/pipermail/halld-offline/2021-March/008497.html>
    The new version set came out Sunday. Note that HDGeant4 now has the
    fix to the calculation of DOCA in the FDC.


      Review of Minutes from the Last Software Meeting

We went over the minutes from the meeting on March 16 
<https://halldweb.jlab.org/wiki/index.php/GlueX_Software_Meeting,_March_16,_2021#Minutes>. 
Mark pointed out that Alex highlighted his exploitation of AmpTools 
ability to use GPUs in his talk at the recent Exotic Search Review. 
Thomas reported that the purchase order for the new farm nodes, 
including those with GPUs on-board, has been signed.


      Minutes from the Last HDGeant4 Meeting

We went over the minutes from the meeting on March 23 
<https://halldweb.jlab.org/wiki/index.php/HDGeant4_Meeting,_March_23,_2021#Minutes>. 
There was some sentiment for closing the issue of drift distances in the 
FDC, but we did not reach a firm decision about where to continue 
discussion of agreement with data.


      OSG Issues

Thomas reported on a recent fix to a problem that has been with us for 
over a year with job submission to the OSG. There had been a cap imposed 
on the number of idle jobs at 1,000 when job execution ground to a halt 
back then due to a large number of such jobs. The root problem was not 
understood. Recently the cap was listed and for a period three weeks ago 
the number of CLAS12 jobs in the idle state ballooned to 70,000. All 
monitoring and job progress stopped and responses to queries to Condor 
would not return. Thomas traced the problem to the jobs writing their 
Condor logs to volatile, generating many small disk accesses to Lustre 
and rendering the system unresponsive. All jobs going from the JLab 
submit host (scosg16.jlab.org) were affected. The solution was to 
dismount all Lustre disk systems from scosg16.

Things are fine now, but there is a backlog of GlueX jobs that are still 
working their way through the system.

Thomas also mentioned two areas for controlling OSG jobs and their 
relative priority.

 1. There is a priority mechanism built in MCwrapper. Requests for
    adjustment should be directed to Thomas.
 2. After discussions with Justin and others, Thomas is instituting a
    upper limit of 250 million events per MCwrapper project. This will
    prevent inadvertent submissions from swamping the system and allow
    for the work-around of submitting more than one project if more
    events are needed.


      Software Testing Discussion

Mark reminded us where some of the existing documentation on our test 
procedures resides, linked from the Offline Software wiki page 
<https://halldweb.jlab.org/wiki/index.php/GlueX_Offline_Software#Testing_and_Debugging>. 
He also went the through the list of items we discussed at the Software 
Meeting on February 2 
<https://halldweb.jlab.org/wiki/index.php/GlueX_Software_Meeting,_February_2,_2021#Standardized_Tests>. 
We a bit of an inconclusive discussion on how to make progress on the 
quality and quantity of our testing regime. Mark suggested a meeting of 
a small group of us to frame the issue. Justin cautioned us there there 
was a lot on our plate with the upcoming APS Meeting and that a portion 
of a Software Meeting after the Meeting might be a good place to come up 
with a plan.


      Review of recent issues and pull requests

Mark called out attention to halld_sim Issue #190 
<https://github.com/JeffersonLab/halld_sim/issues/190>, Run-to-run 
efficiency variation in 2018 run periods", will be discussed at 
tomorrow's Production and Analysis meeting, so we did not discuss it 
directly.

On a related point Justin mentioned that there were several recent 
changes that should get done before new Monte Carlo is produced.

 1. Tagger energy assignment improvement.
      * Reconstruction-launch-compatible halld_recon versions need
        patches to apply the new scheme.
      * REST data sets from previous launches need a new reader to undo
        the old scheme and apply the new one.
 2. Fix to FDC efficiency in Spring 2018
      * This is the issue mentioned above.
 3. additional random trigger file skim
      * Sean reported that there will be an effort to fill in some of
        the gaps in our coverage of runs with corresponding random triggers.

Mark mentioned two of his favorite recent outstanding pull requests:

  * Diracxx: Introduce two make variables: #2
    <https://github.com/JeffersonLab/Diracxx/pull/2>. This will allow
    Diracxx to build on more advanced distributions like CentOS 8 and
    Ubuntu 20.
  * gluex_root_analysis: Top level make mmi #147
    <https://github.com/JeffersonLab/gluex_root_analysis/pull/147>. This
    is a reworking of the build system to use a makefile at all levels
    rather than a mixture of makefiles and shell scripts. The old mixed
    system would not halt when errors were generated in the build. Also
    dependence of the build on ROOT_ANALYSIS_HOME was removed, making it
    easier to build a local version of the package. [Added in press:
    Alex tested and merged the pull request soon after the meeting.]


      The Work Disk is Full

Simon broke the news to us. The proximate cause is the mysterious 
fluctuation of our quota on the disk server. See the red points in the 
plot below:

Work disk 2021-03-30.png 
<https://halldweb.jlab.org/wiki/index.php/File:Work_disk_2021-03-30.png>


      Action Item Review

 1. Make sure that the automatic tests of HDGeant4 pull requests have
    been fully implemented. (Mark I., Sean)
 2. Finish conversion of halld_recon to use JANA2. (Nathan)
 3. Release CCDB 2.0 (Dmitry, Mark I.)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20210330/b0349496/attachment-0001.html>


More information about the Halld-offline mailing list