[Halld-offline] expected bandwidth offsite

Tue May 17 12:00:16 EDT 2016

Chip,

I'm trying to understand the guidance from your text.  Are you stating that large transfers of data (~> 10 TB) offsite are not permitted?  or may be done but may be appropriately throttled at JLab if needed?  or need to be officially coordinated somehow?

The idea here is to enhance analysis capability of the collaboration while having minimal impact on computing resources at the lab.  The last part of this sentence is what I want to be careful about.  If you tell me that copying 10 TB offsite starves other high-priority tasks of needed resources, then I need reevaluate what I'm trying to do.  

My copies were of data that were cached on disk from a recent data processing run - I did not have to stage anything or run any jobs on the farm to do this.

Matt

---------------------------------------------------------------------
Matthew Shepherd, Associate Professor
Department of Physics, Indiana University, Swain West 265
727 East Third Street, Bloomington, IN 47405

Office Phone:  +1 812 856 5808

> On May 17, 2016, at 11:28 AM, Chip Watson <watson at jlab.org> wrote:
> 
> Matt,
> 
> Our provisioning is lean and balanced, meaning that we have provisioned tape and disk bandwidth based upon the size of the farm, now 3K+ cores, moving this summer to 6K-7k cores (2014 generation cores, older cores are as much as 2x slower).  Another upgrade comes in the fall to reach ~8K-10K cores (same units).  Upgrades are based upon the halls' long range plans, and do come with upgrades in disk and tape bandwidth to keep the system balanced.
> 
> "Free" cycles (i.e. not funded by DOE) would require some infrastructure provisioning to make sure that these free cycles don't slow down our other systems (i.e. make sure that there is a real gain).  Small contributions can be absorbed as being in the noise, but larger contributions will require that we do some type of QOS so that we don't starve local resources.  (Example, these recent file transfers of many terabytes did need to tie up a noticeable portion of GlueX's /cache or /volatile quotas.)
> 
> All this to say, you can't really double the performance of the farm for I/O intensive activities without real investments in I/O.  We are lean enough that 10% addition to the farm would be noticed during busy periods.
> 
> We do have some headroom -- which is why we can move GlueX DAQ from 400-800 on very short notice.  But we have priorities in place: (1) DAQ, (2) high utilization of LQCD/HPC, (3) high utilization of the farm, (4) critical WAN transfers (e.g. LQCD data products coming in from supercomputing centers for onsite analysis), (5) interactive analysis, (6) everything else.  Your file transfers are in (6) as they are outside of the reviewed computing plans.
> 
> We are always open to adjusting everything to maximize science (e.g. rapidly doubling DAQ), and Physics division is the division who would be engaged (as funding source) in these optimizations.
> 
> regards,
> Chip
> 
> 
> On 5/17/16 7:56 AM, Shepherd, Matthew wrote:
>> Chip,
>> 
>> This was my initial question:  what is a reasonable rate to expect?
>> 
>> I'm not displeased with 50 MB/s, but I'd like to know if/how there are mechanisms for one or two factors of 2 gain.  About a year ago I managed 130 MB/s averaged over an hour.
>> 
>> I'm basically trying to explore options to move analysis offsite to alleviate load on the JLab farm.  I think we may have some significant resources here at the university level to tap into.  I can imagine things like hosting a data mirror here if your pipeline offsite is not so big.  Local IT folks seem to enjoy pushing limits and are always looking for customers.  Our flagship research machine here, which is getting old by local standards, is a ~1000 node cray with 22,000 cores and a 5 PB high throughput work disk.  Accounts are free to any researcher at the university - it is provided as a service.  I'll certainly only be able to get a fraction of this, but a fraction may be significant on the scales of the types of problems we're working with.
>> 
>> Matt
>> 
>> ---------------------------------------------------------------------
>> Matthew Shepherd, Associate Professor
>> Department of Physics, Indiana University, Swain West 265
>> 727 East Third Street, Bloomington, IN 47405
>> 
>> Office Phone:  +1 812 856 5808
>> 
>>> On May 16, 2016, at 10:34 PM, Chip Watson <watson at jlab.org> wrote:
>>> 
>>> All,
>>> 
>>> Keep in mind that Scientific Computing is not provisioned to deliver that kind of bandwidth (500 MB/s) to you.  If you actually succeeded, we'd probably see a negative impact on operations and have to kill the connection.
>>> 
>>> Chip
>>> 
>>> On 5/16/16 8:45 PM, Shepherd, Matthew wrote:
>>>>> On May 16, 2016, at 5:32 PM, Richard Jones <richard.t.jones at uconn.edu> wrote:
>>>>>  What did you use for the number of parallel streams / job, and number of simultaneous jobs to get the performance you saw?
>>>> I simply used the globus web client and created a personal endpoint using globus personal connect running on Linux.  I didn't modify anything from default.  The single request was the only request I submitted during the time frame.
>>>> 
>>>> I'm not disappointed with 50 MB/s averaged over 2.5 days, but in principle, I have ten times that bandwidth going into the machine.
>>>> 
>>>> Matt
>>>> 
>>>> ---------------------------------------------------------------------
>>>> Matthew Shepherd, Associate Professor
>>>> Department of Physics, Indiana University, Swain West 265
>>>> 727 East Third Street, Bloomington, IN 47405
>>>> 
>>>> Office Phone:  +1 812 856 5808
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Halld-offline mailing list
>>>> Halld-offline at jlab.org
>>>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>