[Halld-offline] Data Challenge Meeting Minutes, December 17, 2012
Mark M. Ito
marki at jlab.org
Tue Dec 18 14:13:47 EST 2012
Find the minutes at
https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_December_17,_2012#Minutes
and below.
___
GlueX Data Challenge Meeting, December 17, 2012
Minutes
Present:
* CMU: Paul Mattione
* JLab: Mark Ito (chair), David Lawrence, Yi Qiang, Dmitry Romanov,
Elton Smith, Simon Taylor, Beni Zihlmann
* UConn: Richard Jones
Data Challenge 1 status
Production started at the three sites Wednesday, December 5, as
planned.
We updated progress at the various sites:
* JLab: 678 million events
* Grid: 3.4 billion events
* CMU: 270 million events
See the [20]Data Challenge 1 page for a few more details.
We ran down some of the problems encountered:
* A lot of the time getting the grid effort started was spent
correcting problems. Since some jobs, after resubmitting themselves
after crashing, would crash again, activity got into a state where a
majority of the jobs were in this infinite loop and had to be
stopped by hand. This was solved by lowering the number of
resubmissions allowed.
* There were occasional segmentation faults in hdgeant. Richard is
investigating the cause.
* mcsmear would sometimes hang. David and Richard chased this down to
the processing thread taking more than 30 seconds with and event
and then killing and re-launching itself without releasing the
mutex lock for the output file.
+ Re-running the job fixed this problem because mcsmear was
seeded differently each time.
+ The lock-release problem will be fixed.
+ We have to find out why it can take more than 30 seconds to
smear an event.
+ The default behavior should be changed to a hard crash.
Re-launching threads could still be retained as an option.
* At JLab some jobs would not produce output files, but would only
end after exceeding the job CPU limit.
* Also at JLab, some of the REST format files did not have the full
50,000 events.
* There may be other failure modes that we have not cataloged. We
will at least try to figure out what happened with all failures.
* At the start of the grid effort the submission node crashed. It was
replaced with a machine with more memory which solved the problem.
We peaked out at 7,000 grid jobs running simultaneously. This was
about 10% of the total grid capacity.
* Another host for the grid system, the user scheduler which
maintains a daemon for each job, also needed more memory to
function under this load.
* The storage resource manager (SRM), that does the transfer of the
output files back to UConn in this case was very reliable. The
gigabit pipe back to UConn was essentially filled during this
effort.
* Richard thought that next time we should do 100 million events and
then go back and debug the code. Mark reminded us that the thinking
was that the failure rate was low enough to do useful work and that
it was more important to get the data challenge going and learn our
lessons, since we will have other challenges in the future. [Note
added in press: coincidentally, 100 million was the size of our
standard mini-challenge. Folks will recall that those challenges
started out with unacceptable failure rates and iterated to iron
out the kinks.]
Curtis's Thoughts
Curtis sent around an [21]email with his assessment of our status and
where he thinks we should go from here. Most notably, we suggests we
write a report on DC-1.
Shutdown/Continuation Plan
There was consensus that given that we have already exceeded out
original goals by over a factor of two that we should stop submitting
more jobs and assess where we are. The expectation is that currently
submitted jobs will run out in a day or two.
Work list for post DC-1 period
* We decided that we would archive all files to the JLab tape
library, REST files, ROOT files, and log files. Details have to be
worked out, but we should do this right away.
* To distribute the data, we will move all of the REST data to UConn
and make it available via the SRM. Note that most of the data is at
UConn already anyway.
* We will also try to have all of the REST data on disk at JLab.
* We should look into SURA grid and see if we have any claim on its
resources.
* Paul suggested doing skims of selected topologies for use by
individuals doing specific analyses. Those interested in particular
types of events should think about making proposals.
* Richard suggested we develop a Jana plug-in to read data using the
SRM directly. The only URL would have to be known and data could be
streamed in.
* To enable general access to the data, we decided that we all get
grid certificates, i. e., obtain credentials for the entire
collaboration. Richard will send instructions on how to get started
with this.
* Problems to address:
+ seg faults in hdgeant
+ hangs in mcsmear
+ random number seed control
Thoughts on DC-2
We need to start thinking about the next data challenge, in particular,
goals and schedule.
Retrieved from
"[22]https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Mee
ting,_December_17,_2012"
References
20. https://halldweb1.jlab.org/wiki/index.php/Data_Challenge_1
21. https://halldweb1.jlab.org/wiki/index.php/Curtis_on_DC-1
--
Mark M. Ito
Jefferson Lab
marki at jlab.org
(757)269-5295
More information about the Halld-offline
mailing list