<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Folks,</p>
<p>Please find the minutes <a moz-do-not-send="true"
href="https://halldweb.jlab.org/wiki/index.php/GlueX_Software_Meeting,_March_30,_2021#Minutes">here</a>
and below.</p>
<p> -- Mark</p>
<p> _______________________________________________</p>
<p>
</p>
<div id="globalWrapper">
<div id="column-content">
<div id="content" class="mw-body" role="main">
<h2 id="firstHeading" class="firstHeading" lang="en"><span
dir="auto">GlueX Software Meeting, March 30, 2021, </span><span
class="mw-headline" id="Minutes">Minutes</span></h2>
<div id="bodyContent" class="mw-body-content">
<div id="mw-content-text" dir="ltr" class="mw-content-ltr"
lang="en">
<p>Present: Alexander Austregesilo, Thomas Britton, Sean
Dobbs, Mark Ito (chair), Igal Jaegle, David Lawrence,
Justin Stevens, Simon Taylor, Nilanga Wickramaarachchi,
Beni Zihlmann
</p>
<p>There is a <a rel="nofollow" class="external text"
href="https://bluejeans.com/s/ZrnFl4s1d9M/">recording
of this meeting</a>. Log into the <a rel="nofollow"
class="external text"
href="https://jlab.bluejeans.com">BlueJeans site</a>
first to gain access (use your JLab credentials).
</p>
<h3><span class="mw-headline" id="Announcements">Announcements</span></h3>
<ol>
<li> <a rel="nofollow" class="external text"
href="https://halldweb.jlab.org/wiki-private/index.php/SciComp_Issue_Tracking">SciComp
Issue Tracking</a>. Sean has put up a wiki page to
collect problem reports regarding the farm, ifarm, and
other SciComp resources at JLab.
<ul>
<li> Alex reported on a problem that has been fixed:
SWIF jobs have recently been failing at a high
rate (10-20% of jobs). The cause was traced back
to slow database access. Jobs that had finished
properly were timing out while trying to transmit
their state to the database and thus were getting
marked as "SWIF system error". Chris Larrieu
discovered a way to increase the speed of the
database queries by a factor of twelve,
eliminating the time-outs and greatly improving
the success rate. Alex will follow up with Chris
and get more information on the miracle cure.</li>
</ul>
</li>
<li> <a rel="nofollow" class="external text"
href="https://mailman.jlab.org/pipermail/halld-offline/2021-March/008497.html">New
version set: version_4.37.0.xml</a> The new version
set came out Sunday. Note that HDGeant4 now has the
fix to the calculation of DOCA in the FDC.</li>
</ol>
<h3><span class="mw-headline"
id="Review_of_Minutes_from_the_Last_Software_Meeting">Review
of Minutes from the Last Software Meeting</span></h3>
<p>We went over the <a
href="https://halldweb.jlab.org/wiki/index.php/GlueX_Software_Meeting,_March_16,_2021#Minutes"
title="GlueX Software Meeting, March 16, 2021">minutes
from the meeting on March 16</a>. Mark pointed out
that Alex highlighted his exploitation of AmpTools
ability to use GPUs in his talk at the recent Exotic
Search Review. Thomas reported that the purchase order
for the new farm nodes, including those with GPUs
on-board, has been signed.
</p>
<h3><span class="mw-headline"
id="Minutes_from_the_Last_HDGeant4_Meeting">Minutes
from the Last HDGeant4 Meeting</span></h3>
<p>We went over the <a
href="https://halldweb.jlab.org/wiki/index.php/HDGeant4_Meeting,_March_23,_2021#Minutes"
title="HDGeant4 Meeting, March 23, 2021">minutes from
the meeting on March 23</a>. There was some sentiment
for closing the issue of drift distances in the FDC, but
we did not reach a firm decision about where to continue
discussion of agreement with data.
</p>
<h3><span class="mw-headline" id="OSG_Issues">OSG Issues</span></h3>
<p>Thomas reported on a recent fix to a problem that has
been with us for over a year with job submission to the
OSG. There had been a cap imposed on the number of idle
jobs at 1,000 when job execution ground to a halt back
then due to a large number of such jobs. The root
problem was not understood. Recently the cap was listed
and for a period three weeks ago the number of CLAS12
jobs in the idle state ballooned to 70,000. All
monitoring and job progress stopped and responses to
queries to Condor would not return. Thomas traced the
problem to the jobs writing their Condor logs to
volatile, generating many small disk accesses to Lustre
and rendering the system unresponsive. All jobs going
from the JLab submit host (scosg16.jlab.org) were
affected. The solution was to dismount all Lustre disk
systems from scosg16.
</p>
<p>Things are fine now, but there is a backlog of GlueX
jobs that are still working their way through the
system.
</p>
<p>Thomas also mentioned two areas for controlling OSG
jobs and their relative priority.
</p>
<ol>
<li> There is a priority mechanism built in MCwrapper.
Requests for adjustment should be directed to Thomas.</li>
<li> After discussions with Justin and others, Thomas is
instituting a upper limit of 250 million events per
MCwrapper project. This will prevent inadvertent
submissions from swamping the system and allow for the
work-around of submitting more than one project if
more events are needed.</li>
</ol>
<h3><span class="mw-headline"
id="Software_Testing_Discussion">Software Testing
Discussion</span></h3>
<p>Mark reminded us where some of the existing
documentation on our test procedures resides, <a
href="https://halldweb.jlab.org/wiki/index.php/GlueX_Offline_Software#Testing_and_Debugging"
title="GlueX Offline Software">linked from the Offline
Software wiki page</a>. He also went the through the
list of items we discussed at the <a
href="https://halldweb.jlab.org/wiki/index.php/GlueX_Software_Meeting,_February_2,_2021#Standardized_Tests"
title="GlueX Software Meeting, February 2, 2021">Software
Meeting on February 2</a>. We a bit of an inconclusive
discussion on how to make progress on the quality and
quantity of our testing regime. Mark suggested a meeting
of a small group of us to frame the issue. Justin
cautioned us there there was a lot on our plate with the
upcoming APS Meeting and that a portion of a Software
Meeting after the Meeting might be a good place to come
up with a plan.
</p>
<h3><span class="mw-headline"
id="Review_of_recent_issues_and_pull_requests">Review
of recent issues and pull requests</span></h3>
<p>Mark called out attention to <a rel="nofollow"
class="external text"
href="https://github.com/JeffersonLab/halld_sim/issues/190">halld_sim
Issue #190</a>, Run-to-run efficiency variation in
2018 run periods", will be discussed at tomorrow's
Production and Analysis meeting, so we did not discuss
it directly.
</p>
<p>On a related point Justin mentioned that there were
several recent changes that should get done before new
Monte Carlo is produced.
</p>
<ol>
<li> Tagger energy assignment improvement.
<ul>
<li> Reconstruction-launch-compatible halld_recon
versions need patches to apply the new scheme.</li>
<li> REST data sets from previous launches need a
new reader to undo the old scheme and apply the
new one.</li>
</ul>
</li>
<li> Fix to FDC efficiency in Spring 2018
<ul>
<li> This is the issue mentioned above.</li>
</ul>
</li>
<li> additional random trigger file skim
<ul>
<li> Sean reported that there will be an effort to
fill in some of the gaps in our coverage of runs
with corresponding random triggers.</li>
</ul>
</li>
</ol>
<p>Mark mentioned two of his favorite recent outstanding
pull requests:
</p>
<ul>
<li> Diracxx: <a rel="nofollow" class="external text"
href="https://github.com/JeffersonLab/Diracxx/pull/2">Introduce
two make variables: #2</a>. This will allow Diracxx
to build on more advanced distributions like CentOS 8
and Ubuntu 20.</li>
<li> gluex_root_analysis: <a rel="nofollow"
class="external text"
href="https://github.com/JeffersonLab/gluex_root_analysis/pull/147">Top
level make mmi #147</a>. This is a reworking of the
build system to use a makefile at all levels rather
than a mixture of makefiles and shell scripts. The old
mixed system would not halt when errors were generated
in the build. Also dependence of the build on
ROOT_ANALYSIS_HOME was removed, making it easier to
build a local version of the package. [Added in press:
Alex tested and merged the pull request soon after the
meeting.]</li>
</ul>
<h3><span class="mw-headline" id="The_Work_Disk_is_Full">The
Work Disk is Full</span></h3>
<p>Simon broke the news to us. The proximate cause is the
mysterious fluctuation of our quota on the disk server.
See the red points in the plot below:
</p>
<p><a
href="https://halldweb.jlab.org/wiki/index.php/File:Work_disk_2021-03-30.png"
class="image"><img alt="Work disk 2021-03-30.png"
src="https://halldweb.jlab.org/wiki/images/thumb/a/a8/Work_disk_2021-03-30.png/500px-Work_disk_2021-03-30.png"
width="500" height="352"></a>
</p>
<h3><span class="mw-headline" id="Action_Item_Review">Action
Item Review</span></h3>
<ol>
<li> Make sure that the automatic tests of HDGeant4 pull
requests have been fully implemented. (Mark I., Sean)</li>
<li> Finish conversion of halld_recon to use JANA2.
(Nathan)</li>
<li> Release CCDB 2.0 (Dmitry, Mark I.)</li>
</ol>
</div>
<br>
</div>
</div>
</div>
</div>
</body>
</html>