[Halld-offline] update on OSG running for dc2

Curtis A. Meyer cmeyer at cmu.edu
Mon Apr 7 08:41:29 EDT 2014


Hi Richard -

  thanks for the update on the OSG. That is very interesting on what happens when you went above 10k cores.

    Curtis
---------
Curtis A. Meyer			MCS Associate Dean for Faculty and Graduate Affairs
Wean:    (412) 268-2745	Professor of Physics
Doherty: (412) 268-3090	Carnegie Mellon University
Fax:         (412) 681-0648	Pittsburgh, PA 15213
curtis.meyer at cmu.edu	http://www.curtismeyer.com/



On Apr 7, 2014, at 8:27 AM, Richard Jones <richard.t.jones at uconn.edu> wrote:

> Hello dc2 crew,
> 
> As you can see from the graph of running processes,
> 
> http://gryphn.phys.uconn.edu/vofrontend/monitor/frontendStatus.html
> 
> we hit an instability in the OSG production at around 11,000 cores.  To get the time on the graph right, you need to select the "-4 hr" timezone from the menu at the bottom of the page.
> 
> We had been running on Saturday around 10k cores for a few hours and everything seemed ok, so I decided to bump up our maximum cores request to 12k.  I did this around 6:00pm. Immediately we started getting more cores, which looked great. But as soon as the running process count hit around 10.5k I started seeing big fluctuations in the swap rate on the submit host.  By the time we hit 11k running processes, my submit host had reached >50% of cpu time spent swapping, which quickly led to a runaway situation where running processes were queuing up for attention, leading to more swapping, etc.  Within 10 minutes the submit host had reached a cpu load of 4200 processes, and 99.7% cpu time spent swapping.  If the submit host goes away, the jobs automatically exit after some timeout currently around 10 minutes, so within 15 minutes our production had cleared out.
> 
> I restarted production with a cap at 9k cores just to be conservative, and after that ran smoothly for 24 hours, I have increased it to 10k cores.  I plan to run with this throttle in place until a new shipment of ram for my submit host arrives later on this week.
> 
> This is why we do these exercises, to find out where the critical points are and resolve them, right?
> 
> -Richard Jones
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/halld-offline/attachments/20140407/a538dd32/attachment.html 


More information about the Halld-offline mailing list