[Halld-offline] Offline Software Meeting Minutes, January 18, 2017

Thu Jan 19 19:59:25 EST 2017

Folks,

Find the minutes below and at 
https://halldweb.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_January_18,_2017#Minutes 
.

   -- Mark

____________________

    GlueX Offline Meeting, January 18, 2017, Minutes

Present:

  * *FIU*: Mahmoud Kamel
  * *JLab*: Alexander Austregesilo, Nathan Baltzell, Alex Barnes, Thomas
    Britton, Brad Cannon, Mark Ito (chair), Nathan Sparks, Kurt
    Strosahl, Simon Taylor, Beni Zihlmann
  * *MIT*: Cristiano Fanelli
  * *NU*: Sean Dobbs
  * *UConn*: Richard Jones + 2
  * *W&M*: Justin Stevens

There is arecording of this meeting <https://bluejeans.com/s/tgAip/>on 
the BlueJeans site.

      Announcements

 1. *Backups of the RCDB database in SQLite*form are now being kept on
    the write through cache, in /cache/halld/home/gluex/rcdb_sqlite/.
    SeeMark's email
    <https://mailman.jlab.org/pipermail/halld-offline/2017-January/002573.html>for
    more details.
 2. *Development of a wrapper for signal MC generation*. Thomas has
    written scripts to wrap the basic steps of signal Monte Carlo
    generation. One can specify the number of events, the .input file to
    use for genr8, and jobs will be submitted via SWIF. Paul thought the
    the average user would find this useful. Mark suggested that the
    code could be version controlled with thehd_utilities repository
    <https://github.com/JeffersonLab/hd_utilities>on GitHub.
 3. *More Lustre space*. Mark reported that our total Lustre space has
    been increased from a quota of 200 TB to 250 TB. Seehis email
    <https://mailman.jlab.org/pipermail/halld-offline/2017-January/002594.html>for
    a few more details.

      Lustre system status

Kurt Strosahl, of JLab SciComp, dropped by to give us a report on the 
recent problems with theLustre file system 
<https://en.wikipedia.org/wiki/Lustre_%28file_system%29>. This has 
affected our work, cache, and volatile directories. Lustre aggregates 
multiple partitions on multiple raid arrays, "block devices" or "Object 
Store Targets (OSTs), and presents to users a view of one large disk 
partition. There are redundant metadata systems to keep track of which 
files are where.

On New Year's Day, due to Infiniband problems, a fail-over from one 
metadata system to the other was initiated mistakenly. In the confusion 
both systems tried to mount a few of the OSTs causing corruption of the 
metadata for five of the 74. This was the first time a fail-over has 
occurred for a production system at JLab. Intel and SciComp have been 
working together to recover the metadata. The underlying files appear to 
be OK, but without the metadata they cannot be accessed. So far, 
metadata for four of the five OSTs has been repaired and it appears that 
their files have reappeared intact. This work has been going on for over 
two weeks now; there is no definite estimate on when the last OST will 
be recovered. Fail-over has been inhibited for now.

We asked Kurt about recent troubles with ifarm1102. That particular 
nodes has been having issues with its Infiniband interface and has now 
been removed from the rotation of ifarm machines.

      Review of minutes from the last meeting

We went over theminutes from 
December<https://halldweb.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_December_21,_2016#Minutes>(all)

  * The problem with reading data with multiple threads turn out indeed
    to be from corrupted data. There was an issue with the raid arrays
    in the counting room.
  * Sean has had further discussions on how we handle HDDS XML files.
    There is a plan now.
  * Mark will ask about getting us an update on the OSG appliance.

      Launches

Alex A. gave the report.

        2016-10 offline monitoring ver02

Alex wanted to start this before the break, but calibration were not 
ready. Instead processing started the first week of January, but since 
there was not a lot of data in the run, it finished in a week. The 
gxproj1 account was used. There were some minor problems with the post 
processing scripts that have now been fixed.

We are waiting for new calibration results before starting ver03, 
perhaps sometime next week. There was an issue with the propagation 
delay calibration for the TOF that has now been resolved and there are 
on-going efforts with BCAL and FCAL calibrations. The monitoring launch 
gives us a check on the quality of the calibrations for the entire 
running period.

        2016-02 analysis launch ver05

This launch started before the break. Jobs are running with only six 
threads. Large variation in execution time and peak memory use has been 
observed. The cause has been traced to a few channels what require many 
photons (e. g., 3π^0 ) that can generate huge numbers of combinations 
and stop progress on a single thread. Several solutions were discussed, 
including re-writing parts of the analysis library and cutting off 
processing for events that generate too many combinations. In addition, 
in the future the list of plugins may get trimmed. This last effort took 
the philosophy of running "everything" to see how many channels we can 
reasonably get through.

      Sim 1.2

Mark reported that the 50 k jobs that have been submitted are going 
through very slowly. Processing started in the middle of the break and 
is only 20% done and this batch is only 20% of the total we planned to 
simulate. The processing time is dominated by the generation of 
electromagnetic background independently for each event. After some 
discussion of the purpose of the resulting data set, we decided to 
re-launch the effort without generation of E&M background. The data 
should still be useful for studying efficiency/acceptance.

      HDGeant/HDGeant4 Update

Richard gave us an update on the development effort.

  * He is doing a tag-by-tag comparison of the output from HDGeant ("G3"
    for our purposes) and HDGeant4 ("G4"), comparing both truth and hit
    information. For 90% of the discrepancies he finds it is the new G4
    code that needs fixing, but the other 10% come from G3 errors,
    mostly in truth information that is not looked at as often.
  * To do the comparison he has developed a new tool, hddm-root, that
    creates a ROOT tree auto-magically directly from an HDDM file. This
    allows quick histogramming of quantities for comparison.
  * Detectors where agreement has been verified: CDC, FDC, BCAL, FCAL,
    TOF, tagger, pair spectrometer (course and fine), triplet polarimeter.
  * The triplet polarimeter simulation was adopted from code from Mike
    Dugger and was originally implemented in G4, but has also been
    back-ported to G3.
  * To test the TPOL simulation, a new card has been introduced,
    GENB[?], that will track beam photons down the beamline from the
    radiator. It has three modes: pre-coll, post-coll, and post-conv
    which end tracking at the collimator, at the converter, and on
    through to the TPOL respectively. The generated particle information
    can be written out in HDDM format and serve as input to either G3 or
    G4, just as for any other of our event generators.
  * The coherent bremsstrahlung generator has been implemented in G4 and
    compared to that of G3.
  * "Fake" tagger hits are now being generated in G4 in the same manner
    as was done in G3. Also a new tag RFTime[?] has been introduced. It
    is a single time that sets the "true" phase of the RF used in the
    simulation.
  * Other detectors implemented: the DIRC, the MWPC (for the CPP
    experiment), and for completeness the gas RICH, the gas Cerenkov,
    and the UPV.
  * The MCTRAJECTORY card has been implemented in G4 and its
    implementation in G3 fixed. This allows output of position
    information for particle birth, death, and/or points in between for
    primary and/or secondary particles in a variety of combinations of
    those items.
  * The following G3 cards have been implemented in G4. The secretary
    will refer the reader to the documentation in the sample control.in
    for definitions of most of these.
      o KINE
      o SCAT
      o TARGET
      o BGGATE
      o BGRATE
  * The following cards will not be implemented in G4
      o CUTS
      o SWITCH
          + CUTS and SWITCH do not fit into the Geant4 design philosophy
      o GELHAD
          + photonuclear interactions are now provided natively in Geant4
  * The following cards are being implemented now:
      o HADR
          + The meaning in G4 has been modified to control turning
            on/off all hadronic interaction processes to save users the
            bother of doing so one by one
      o CKOV
      o LABS
      o NOSEC
      o AUTO
      o BFIELD_MAP
      o PSFIELD_MAP
      o SAVEHIT
      o SHOWERS_IN_COLLIMATOR
      o DRIFT_CLUSTERS
      o MULS
      o BREMS
      o COMPT
      o LOSS
      o PAIR
      o DECAY
      o DRAY

Beni will describe the scheme he implemented in G3 to preserve the 
identify of secondary particles and transmit the description to Richard.

Performance remains an issue but is not an area of focus at this stage. 
Richard has seen a slow-down of a factor of 24 per thread going from G3 
to G4. At this point G4 is generating two-orders of magnitude more 
secondary particles, mostly neutrons, compared to G3. A simple kinetic 
energy threshold adjustment did not make much of a difference.

Sean made a couple of comments:

 1. The problem Richard discovered with missing TDC hits in the BCAL has
    been traced to the generation of digi-hits for the BCAL. CCDB
    constants had to be adjusted to bring those hits back.
 2. Caution should be used with the current pair spectrometer field map
    called for the in CCDB. It is only a preliminary rough guess.

Mark needs to create a standard build of G4.

Richard requested that if folks have problems, questions, or 
suggestions, they should log anissue on GitHub 
<https://github.com/rjones30/HDGeant4/issues>.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20170119/ed5bd0f4/attachment-0001.html>