[Halld-offline] Data Challenge Meeting Minutes, December 17, 2012

Tue Dec 18 14:13:47 EST 2012

Find the minutes at

https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_December_17,_2012#Minutes

and below.

___

GlueX Data Challenge Meeting, December 17, 2012
Minutes

    Present:
      * CMU: Paul Mattione
      * JLab: Mark Ito (chair), David Lawrence, Yi Qiang, Dmitry Romanov,
        Elton Smith, Simon Taylor, Beni Zihlmann
      * UConn: Richard Jones

Data Challenge 1 status

    Production started at the three sites Wednesday, December 5, as
    planned.

    We updated progress at the various sites:
      * JLab: 678 million events
      * Grid: 3.4 billion events
      * CMU: 270 million events

    See the [20]Data Challenge 1 page for a few more details.

    We ran down some of the problems encountered:
      * A lot of the time getting the grid effort started was spent
        correcting problems. Since some jobs, after resubmitting themselves
        after crashing, would crash again, activity got into a state where a
        majority of the jobs were in this infinite loop and had to be
        stopped by hand. This was solved by lowering the number of
        resubmissions allowed.
      * There were occasional segmentation faults in hdgeant. Richard is
        investigating the cause.
      * mcsmear would sometimes hang. David and Richard chased this down to
        the processing thread taking more than 30 seconds with and event
        and then killing and re-launching itself without releasing the
        mutex lock for the output file.
           + Re-running the job fixed this problem because mcsmear was
             seeded differently each time.
           + The lock-release problem will be fixed.
           + We have to find out why it can take more than 30 seconds to
             smear an event.
           + The default behavior should be changed to a hard crash.
             Re-launching threads could still be retained as an option.
      * At JLab some jobs would not produce output files, but would only
        end after exceeding the job CPU limit.
      * Also at JLab, some of the REST format files did not have the full
        50,000 events.
      * There may be other failure modes that we have not cataloged. We
        will at least try to figure out what happened with all failures.
      * At the start of the grid effort the submission node crashed. It was
        replaced with a machine with more memory which solved the problem.
        We peaked out at 7,000 grid jobs running simultaneously. This was
        about 10% of the total grid capacity.
      * Another host for the grid system, the user scheduler which
        maintains a daemon for each job, also needed more memory to
        function under this load.
      * The storage resource manager (SRM), that does the transfer of the
        output files back to UConn in this case was very reliable. The
        gigabit pipe back to UConn was essentially filled during this
        effort.
      * Richard thought that next time we should do 100 million events and
        then go back and debug the code. Mark reminded us that the thinking
        was that the failure rate was low enough to do useful work and that
        it was more important to get the data challenge going and learn our
        lessons, since we will have other challenges in the future. [Note
        added in press: coincidentally, 100 million was the size of our
        standard mini-challenge. Folks will recall that those challenges
        started out with unacceptable failure rates and iterated to iron
        out the kinks.]

Curtis's Thoughts

    Curtis sent around an [21]email with his assessment of our status and
    where he thinks we should go from here. Most notably, we suggests we
    write a report on DC-1.

Shutdown/Continuation Plan

    There was consensus that given that we have already exceeded out
    original goals by over a factor of two that we should stop submitting
    more jobs and assess where we are. The expectation is that currently
    submitted jobs will run out in a day or two.

Work list for post DC-1 period

      * We decided that we would archive all files to the JLab tape
        library, REST files, ROOT files, and log files. Details have to be
        worked out, but we should do this right away.
      * To distribute the data, we will move all of the REST data to UConn
        and make it available via the SRM. Note that most of the data is at
        UConn already anyway.
      * We will also try to have all of the REST data on disk at JLab.
      * We should look into SURA grid and see if we have any claim on its
        resources.
      * Paul suggested doing skims of selected topologies for use by
        individuals doing specific analyses. Those interested in particular
        types of events should think about making proposals.
      * Richard suggested we develop a Jana plug-in to read data using the
        SRM directly. The only URL would have to be known and data could be
        streamed in.
      * To enable general access to the data, we decided that we all get
        grid certificates, i. e., obtain credentials for the entire
        collaboration. Richard will send instructions on how to get started
        with this.
      * Problems to address:
           + seg faults in hdgeant
           + hangs in mcsmear
           + random number seed control

Thoughts on DC-2

    We need to start thinking about the next data challenge, in particular,
    goals and schedule.
    Retrieved from
    "[22]https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Mee
    ting,_December_17,_2012"

References

   20. https://halldweb1.jlab.org/wiki/index.php/Data_Challenge_1
   21. https://halldweb1.jlab.org/wiki/index.php/Curtis_on_DC-1

-- 
Mark M. Ito
Jefferson Lab
marki at jlab.org
(757)269-5295