[Halld-offline] speeds for download of secondary GlueX datasets from JLab

Mon May 23 17:25:45 EDT 2016

Dear Matt,

Following up on your query regarding download speeds for fetching secondary
datasets from Jlab to offsite storage resources, I have the following
experience to share.

   1. *first bottleneck:* switch buffer overflows (last 6 feet) -- data
   path was 10Gb from source to my server until last 6 feet, where it dropped
   to 1Gb. Performance (tcp) was highly asymmetric: 95 MB/s upload speed, but
   poor and oscillating download speed averaging *15 MB/s*. This asymmetry
   was due to switch buffer overflows at the switch port where it necks down
   from 10Gb to 1Gb -- tcp does not have any back-pressure mechanism except
   packet loss, which tends to be catastrophic over high-latency pathways with
   std linux kernel congestion algos cubic, htcp.
   2. *second bottleneck:* disk speed on receiving server -- as soon as I
   replaced the last 6 feet with a 10Gb NIC / transceiver, I moved up to the
   next resistance point, around *140 MB/s* on my server. Using diagnostics
   I could see that my disk drives (2 commodity SATA 1TB drives in parallel)
   were both saturating their write queues. At this speed I was filling up my
   disks fast, so I had to start simultaneous jobs to flush these files from
   temporary filesystem on the receiving server to permanent storage in my
   dcache. Once the drives were interleaving reads and writes, the download
   performance dropped to around 70MB/s net for both drives.
   3. *third bottleneck:* fear of too much success -- to see what the next
   limiting point might be, I switched to a data transfer node that the UConn
   data center made available for testing. It combines a 10Gb nic connected to
   a central campus switch and what Dell calls a high-performance raid (Dell
   H700, 500GB, probably large fraction of this is SSD). On this system I
   never saw the disks saturate their read/write queues. However the
   throughput rose quickly as the transfers started, and as soon as I saw
   transfers exceeding *300MB/s* I remembered Chip's warning and cancelled
   the job. I then decreased the number of parallel streams (from the globus
   online defaults) to limit the impact on JLab infrastructure. Using just 1
   simultaneous transfer / 2 parallel streams (globus default is 2 / 4) I was
   seeing a steady-state rate between 150 and *200 MB/s *average download
   speed, even with simultaneous downloading and pushing from the fast raid to
   my dcache (multiple parallel jobs) -- which was necessary to keep from
   overflowing this 500GB partition in a matter of minutes. Decreasing the
   globus options to just 1 / 1 I was able to limit the speed to *120 MB/s*
   which is still enough to make me happy for now.

I know without any fiddling you were able to get somewhere between
bottlenecks 1 and 2 above. From this log of lessons learned, I suspect you
will know what steps you might take to increase your speed to the next
resistance point. One suggestion for the future: we should coordinate this.
For example, anyone who wants offsite access to the PS triggers should get
them from UConn, not fetch them again from Jlab, since we already have the
full set of them from Spring 2016 in gridftp-accessible storage at UConn.
Likewise for what you pull to IU? Perhaps we should set up a central place
where we record what GlueX data is available, where, and by what protocol.

-Richard Jones
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20160523/0e71236f/attachment.html>