[Halld-offline] speeds for download of secondary GlueX datasets from JLab

Mon May 23 21:51:06 EDT 2016

Thanks for all the testing that people have been doing. It is very interesting to see where the issues
are in moving data around. As far as GlueX goes, our plan has always been to be able to move REST
data offsite for remote analysis, but I don’t think that we ever though we would want to move 1 years
worth of data in a few days. I suspect that Chips limits are ok at this point in time, but we should 
certainly watch things. We probably also need to consider a model where we send the data to one
point (say IU), and then replicate it from there as needed.

Curtis
---------
Curtis A. Meyer			MCS Associate Dean for Faculty and Graduate Affairs
Wean:    (412) 268-2745	Professor of Physics
Doherty: (412) 268-3090	Carnegie Mellon University
Fax:         (412) 681-0648	Pittsburgh, PA 15213
curtis.meyer at cmu.edu	http://www.curtismeyer.com/

> On May 23, 2016, at 9:31 PM, Shepherd, Matthew <mashephe at indiana.edu> wrote:
> 
> 
> Richard,
> 
> This is interesting - thanks for sharing.  I expect we have common second bottlenecks as I was also using commodity drives on a RAID array.
> 
> I have not tried to copy directly to, for example, our "data capacitor" which is a university managed high-throughput work disk that has several petabytes of capacity (10 TB quota for starters).  It may be fun to try copying between UConn and IU using these resources.
> 
> I agree that having some data mirrors organized on a web page with the appropriate grid ftp endpoints makes sense.
> 
> Chip:  I think that our preliminary conclusion is that we don't need this as a production service.  We can likely operate within your suggested limits "on the fringe" and provide a useful, organized, data resource to people who want to use production data outside of JLab.
> 
> Matt
> 
> ---------------------------------------------------------------------
> Matthew Shepherd, Associate Professor
> Department of Physics, Indiana University, Swain West 265
> 727 East Third Street, Bloomington, IN 47405
> 
> Office Phone:  +1 812 856 5808
> 
>> On May 23, 2016, at 5:25 PM, Richard Jones <richard.t.jones at uconn.edu> wrote:
>> 
>> Dear Matt,
>> 
>> Following up on your query regarding download speeds for fetching secondary datasets from Jlab to offsite storage resources, I have the following experience to share.
>> 	• first bottleneck: switch buffer overflows (last 6 feet) -- data path was 10Gb from source to my server until last 6 feet, where it dropped to 1Gb. Performance (tcp) was highly asymmetric: 95 MB/s upload speed, but poor and oscillating download speed averaging 15 MB/s. This asymmetry was due to switch buffer overflows at the switch port where it necks down from 10Gb to 1Gb -- tcp does not have any back-pressure mechanism except packet loss, which tends to be catastrophic over high-latency pathways with std linux kernel congestion algos cubic, htcp.
>> 	• second bottleneck: disk speed on receiving server -- as soon as I replaced the last 6 feet with a 10Gb NIC / transceiver, I moved up to the next resistance point, around 140 MB/s on my server. Using diagnostics I could see that my disk drives (2 commodity SATA 1TB drives in parallel) were both saturating their write queues. At this speed I was filling up my disks fast, so I had to start simultaneous jobs to flush these files from temporary filesystem on the receiving server to permanent storage in my dcache. Once the drives were interleaving reads and writes, the download performance dropped to around 70MB/s net for both drives.
>> 	• third bottleneck: fear of too much success -- to see what the next limiting point might be, I switched to a data transfer node that the UConn data center made available for testing. It combines a 10Gb nic connected to a central campus switch and what Dell calls a high-performance raid (Dell H700, 500GB, probably large fraction of this is SSD). On this system I never saw the disks saturate their read/write queues. However the throughput rose quickly as the transfers started, and as soon as I saw transfers exceeding 300MB/s I remembered Chip's warning and cancelled the job. I then decreased the number of parallel streams (from the globus online defaults) to limit the impact on JLab infrastructure. Using just 1 simultaneous transfer / 2 parallel streams (globus default is 2 / 4) I was seeing a steady-state rate between 150 and 200 MB/s average download speed, even with simultaneous downloading and pushing from the fast raid to my dcache (multiple parallel jobs) -- which was necessary to keep from overflowing this 500GB partition in a matter of minutes. Decreasing the globus options to just 1 / 1 I was able to limit the speed to 120 MB/s which is still enough to make me happy for now.
>> I know without any fiddling you were able to get somewhere between bottlenecks 1 and 2 above. From this log of lessons learned, I suspect you will know what steps you might take to increase your speed to the next resistance point. One suggestion for the future: we should coordinate this. For example, anyone who wants offsite access to the PS triggers should get them from UConn, not fetch them again from Jlab, since we already have the full set of them from Spring 2016 in gridftp-accessible storage at UConn. Likewise for what you pull to IU? Perhaps we should set up a central place where we record what GlueX data is available, where, and by what protocol.
>> 
>> -Richard Jones
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20160523/9287a85c/attachment-0002.html>