[Halld-offline] Offline Software Meeting, April 2, 2014
Mark Ito
marki at jlab.org
Fri Apr 4 09:56:25 EDT 2014
Folks,
Find the minutes below and at
https://halldweb1.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_April_2,_2014#Minutes
-- Mark
______________________________________________________
GlueX Offline Meeting, April 2, 2014
Minutes
Present:
* CMU: Paul Mattione, Curtis Meyer
* IU: Kei Moriya
* JLab: Mark Ito (chair), Sandy Philpott, Dmitry Romanov, Simon
Taylor, Beni Zihlmann
* MIT: Justin Stevens
* NU: Sean Dobbs
Review of Minutes from the Last Meeting
We looked over the [24]minutes from March 19. Sean has done some work
wrapping HDDM calls for use with Python[?], as part of exploring the
use of EventStore.
Data Challenge Meeting Report, March 28
We also looked over [25]these minutes as well. Some of Mark's comments
(see below) addressed issues raised last Friday.
Plot of Running DC2 Jobs as a Function of Time at JLab
Mark showed a plot:
[26]Jobs gluex.png
We have borrowed another 1000 cores from the LQCD farm, bringing our
share up to about 4000 cores. This last slug came in over the last
couple of days.
The large fluctuations are due to the fact that the farm scheduler
cannot take into account usage by a user until the end of jobs. At
start-up, if jobs take 24 hours to run (as these do), then during that
initial period usage from those jobs is assumed to be zero. Also during
this period, the user is boosted in priority in order to make up for
lack of usage in the recent past. After the jobs complete are done, and
all of that usage accounted for, the user appears to be over quota and
gets turned off for a while, and so on. This turns out to be completely
normal behavior for the system given long-standing parameter settings.
Comments on DC2 Issues
Mark led us through his wiki page, commenting on three topics:
1. Monitoring quality of the current data challenge
+ We decided that we would by hand look at the monitoring
histograms we are producing for each job, for every 1000th
job. Simon will do the looking at JLab. Sean will share a
script he has written to compare histograms to standards. This
should help.
2. File transfers in and out of JLab
+ Sean thought that that the Globus Online options would not
work for pushing files to SRM-capable sites. He thought that
the SRM client tools would be sufficient if they could be
installed at JLab. He also suggested that we look into raw
GridFTP (as Chip Watson has suggested in the past).
3. Event Tally Board
+ We agreed to maintain a [27]Data Challenge 2 Event Tally Board
to keep track of progress.
Returning Nodes to LQCD
We had a brief discussion on how long we should be using the nodes we
have borrowed from LQCD. We still have a substantial balance on the
amount owed to Physics from the December-March loan to LQCD. Curtis
pointed out that we have already hit a 4500 job milestone, exceeded the
benchmark of 1250 cores that had been set for us. Mark pointed out that
the cores are all doing useful work. The OSG "site" has not come online
yet. Given that the OSG contributed 80% of the cycles for the last data
challenge it is hard so say where we are now.
We did not come to firm decision but will have to revisit this every
few days or so. For now we continue to run with the 4000 total cores.
REST Filesizes and Reproducibility
Kei presented recent studies he has done comparing repeated
reconstruction runs on the same smeared event file. See [28]his slides
for details. His slides covered:
* Output file sizes
* A bad log file (hdgeant)
* Run info (cpu time, virtual memory)
* File size correlation (mcsmear, iteration to iteration)
* File size correlation (REST, iteration to iteration)
* File size correlation at IU
* hd_dump of factories (comparison of iterations)
* Single different event
* File size correlation with CMU (REST)
We remarked that numerical differences observed are truly in the
round-off error regime. We also thought it was odd that identical runs
at IU and CMU should differ by as much as 1% in file size. Progress
from this point looks difficult given the small differences being
reported. Kei will continue his studies.
Next Data Challenge Meeting
We agreed to [29]meet again on Friday to update tallies and discuss
schedule.
Retrieved from
"https://halldweb1.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_April_2,_2014"
References
24.
https://halldweb1.jlab.org/wiki/index.php/GlueX_Offline_Meeting,_March_19,_2014#Minutes
25.
https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_March_28,_2014#Minutes
26. https://halldweb1.jlab.org/wiki/index.php/File:Jobs_gluex.png
27.
https://halldweb1.jlab.org/wiki/index.php/Data_Challenge_2_Event_Tally_Board
28. https://halldweb1.jlab.org/wiki/images/4/4e/2014-04-02-dc2.pdf
29.
https://halldweb1.jlab.org/wiki/index.php/GlueX_Data_Challenge_Meeting,_April_4,_2014
--
Mark M. Ito, Jefferson Lab, marki at jlab.org, (757)269-5295
More information about the Halld-offline
mailing list