[Halld-offline] Fwd: Near Term Evolution of JLab Scientific Computing

Wed May 10 16:30:58 EDT 2017

If any of you are interested in participating, please let me or David know.

-------- Forwarded Message --------
Subject: 	Near Term Evolution of JLab Scientific Computing
Date: 	Mon, 8 May 2017 13:56:59 -0400
From: 	Chip Watson <watson at jlab.org>
To: 	Mark Ito <marki at jlab.org>, David Lawrence <davidl at jlab.org>, Ole 
Hansen <ole at jlab.org>, Harut Avakian <avakian at jlab.org>, Brad Sawatzky 
<brads at jlab.org>, Paul Mattione <pmatt at jlab.org>, Markus Diefenthaler 
<mdiefent at jlab.org>, Graham Heyes <heyes at jlab.org>, Sandy Philpott 
<philpott at jlab.org>

All,

It was enjoyable to listen to the talks at the "Future Trends in NP 
Computing" workshop last week.  Some of the more amazing talks showed 
what can be done with large investments from talented people plus ample 
funding for R&D and deployment. I particularly enjoyed hearing how 
machine learning trained the trigger for an LHC experiment, replacing 
fast simple trigger logic with (opaque?) better performing software.

I would like to turn our thoughts toward the immediate future, and start 
a new discussion of the "Near Term Evolution of JLab Scientific 
Computing".  Some of you on this list are regular attenders of the 
Scientific Computing / Physics meetings, and so are included in this 
email.  Others are people who have been active in computing here at the 
lab.  This list is NOT carefully thought out, but does include people 
from all halls, with an emphasis on GlueX since I receive more of their 
emails to help me guess who might be interested.  I invite all of you to 
forward this to others whom you think would like to join this effort.

*Scope: *Figure out the best path forward for JLab computing for the 
next 1-2 years, with a particular focus on meeting the needs of 
Experimental Physics in FY18.

Our baseline plan has been to satisfy all computing requirements 
in-house.  This approach is certainly the most cost effective way to 
deploy integrated Tflops or SpecIntRate capacity with good bandwidth to 
all the data.  But it has known weaknesses: it works best when there is 
a load that is constant (mostly) throughout the year.  Physicists work 
best when there is infinite capacity on-demand.  More capacity that 
someone else pays for (other than DOE NP) might be available, and 
already is for GlueX and their use of OSG resources. Somewhere there may 
be an optimum where Physicists perceive reasonable turn around on real 
loads, and money is being spent in an effective way.

*Process:*

1. put together a knowledgeable and interested group of experts
     (in requirements, in technology, etc.)

2. update hall requirements for FY18, with particular emphasis on
     a. I/O per SpecInt job requirements
         (roughly how much bandwidth to support a job of N cores)
     b. real, necessary load fluctuations
         (i.e. what peaks do we need to support, i.e. worth some money)

3. roughly evaluate the possible ways to satisfy these requirements

     a. in-house baseline, assuming GlueX can offload work to the grid

     b. pushing work to a DOE center (NERSC, ORNL, ...) for free

     c. paying to use a Cloud for important peaks

     d. expanding use of OSG to more halls

     Each of these approaches has different costs.  For example, as more 
work is pushed offsite, we will need to upgrade the lab's network from 
10g to 40g (at least).  Use of the grid has more of an impact on users 
than does in-house or cloud (cloud infrastructure as a service can 
appear to be part of our Auger / PBS system).  All solutions which 
double performance but assume storage at JLab will require upgrades to 
local disk and tape sub-systems, even if the farm remains fixed in size.

4. technology pilots for best looking alternatives

5. software evaluations of software products, or software
     developments of in-house products, to make this
     possibly more complex ecosystem more transparent
     to the user

*Time Commitment*

Initially, I would like us to meet weekly with gaps due to 
unavailability of too many people.  As work becomes better defined, this 
can switch to every 2-3 weeks, with people doing "homework" between 
meetings and reporting back.

I would like preliminary decisions reached by mid August, so that if 
there is any end of year funding available, we can put forward a 
proposal for investments (in farm, disk, tape, or wide area 
networking).  I also see this helping to shape the FY18 budgets for the 
effected divisions and groups.  So, possibly 8 meetings.

Due to the HUGE uncertainty in budgets for FY18, we will plan against 2 
scenarios with different numbers of weeks of running (which drives all 
of JLab's experimental physics computing).

Anticipated evolution of code performance is an important topic.

*Your Next Steps*

1. Forward to key people omitted from this sparse list (I'm looking for 
a total of 7-12 people, including myself, Graham, and Sandy)

2. Reply to me to let me know if you would like to be a participant 
(everyone will get a report in late summer).

3. Help me in setting an initial meeting date: reply with what hours you 
could be available on each of the following dates:

     Thursday May 11

     Thursday May 18

     Monday May 22

     Tuesday May 23

     Wednesday May 24

I will get back to each of you who says "yes" as soon as it is clear 
when we can reach a decent quorum.  Remote participation can be 
supported for working team members (no "listeners only" please).

regards,

Chip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20170510/555c47ea/attachment.html>