[Halld-offline] Fwd: Near Term Evolution of JLab Scientific Computing
marki at jlab.org
Wed May 10 16:30:58 EDT 2017
If any of you are interested in participating, please let me or David know.
-------- Forwarded Message --------
Subject: Near Term Evolution of JLab Scientific Computing
Date: Mon, 8 May 2017 13:56:59 -0400
From: Chip Watson <watson at jlab.org>
To: Mark Ito <marki at jlab.org>, David Lawrence <davidl at jlab.org>, Ole
Hansen <ole at jlab.org>, Harut Avakian <avakian at jlab.org>, Brad Sawatzky
<brads at jlab.org>, Paul Mattione <pmatt at jlab.org>, Markus Diefenthaler
<mdiefent at jlab.org>, Graham Heyes <heyes at jlab.org>, Sandy Philpott
<philpott at jlab.org>
It was enjoyable to listen to the talks at the "Future Trends in NP
Computing" workshop last week. Some of the more amazing talks showed
what can be done with large investments from talented people plus ample
funding for R&D and deployment. I particularly enjoyed hearing how
machine learning trained the trigger for an LHC experiment, replacing
fast simple trigger logic with (opaque?) better performing software.
I would like to turn our thoughts toward the immediate future, and start
a new discussion of the "Near Term Evolution of JLab Scientific
Computing". Some of you on this list are regular attenders of the
Scientific Computing / Physics meetings, and so are included in this
email. Others are people who have been active in computing here at the
lab. This list is NOT carefully thought out, but does include people
from all halls, with an emphasis on GlueX since I receive more of their
emails to help me guess who might be interested. I invite all of you to
forward this to others whom you think would like to join this effort.
*Scope: *Figure out the best path forward for JLab computing for the
next 1-2 years, with a particular focus on meeting the needs of
Experimental Physics in FY18.
Our baseline plan has been to satisfy all computing requirements
in-house. This approach is certainly the most cost effective way to
deploy integrated Tflops or SpecIntRate capacity with good bandwidth to
all the data. But it has known weaknesses: it works best when there is
a load that is constant (mostly) throughout the year. Physicists work
best when there is infinite capacity on-demand. More capacity that
someone else pays for (other than DOE NP) might be available, and
already is for GlueX and their use of OSG resources. Somewhere there may
be an optimum where Physicists perceive reasonable turn around on real
loads, and money is being spent in an effective way.
1. put together a knowledgeable and interested group of experts
(in requirements, in technology, etc.)
2. update hall requirements for FY18, with particular emphasis on
a. I/O per SpecInt job requirements
(roughly how much bandwidth to support a job of N cores)
b. real, necessary load fluctuations
(i.e. what peaks do we need to support, i.e. worth some money)
3. roughly evaluate the possible ways to satisfy these requirements
a. in-house baseline, assuming GlueX can offload work to the grid
b. pushing work to a DOE center (NERSC, ORNL, ...) for free
c. paying to use a Cloud for important peaks
d. expanding use of OSG to more halls
Each of these approaches has different costs. For example, as more
work is pushed offsite, we will need to upgrade the lab's network from
10g to 40g (at least). Use of the grid has more of an impact on users
than does in-house or cloud (cloud infrastructure as a service can
appear to be part of our Auger / PBS system). All solutions which
double performance but assume storage at JLab will require upgrades to
local disk and tape sub-systems, even if the farm remains fixed in size.
4. technology pilots for best looking alternatives
5. software evaluations of software products, or software
developments of in-house products, to make this
possibly more complex ecosystem more transparent
to the user
Initially, I would like us to meet weekly with gaps due to
unavailability of too many people. As work becomes better defined, this
can switch to every 2-3 weeks, with people doing "homework" between
meetings and reporting back.
I would like preliminary decisions reached by mid August, so that if
there is any end of year funding available, we can put forward a
proposal for investments (in farm, disk, tape, or wide area
networking). I also see this helping to shape the FY18 budgets for the
effected divisions and groups. So, possibly 8 meetings.
Due to the HUGE uncertainty in budgets for FY18, we will plan against 2
scenarios with different numbers of weeks of running (which drives all
of JLab's experimental physics computing).
Anticipated evolution of code performance is an important topic.
*Your Next Steps*
1. Forward to key people omitted from this sparse list (I'm looking for
a total of 7-12 people, including myself, Graham, and Sandy)
2. Reply to me to let me know if you would like to be a participant
(everyone will get a report in late summer).
3. Help me in setting an initial meeting date: reply with what hours you
could be available on each of the following dates:
Thursday May 11
Thursday May 18
Monday May 22
Tuesday May 23
Wednesday May 24
I will get back to each of you who says "yes" as soon as it is clear
when we can reach a decent quorum. Remote participation can be
supported for working team members (no "listeners only" please).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Halld-offline