[Halld-offline] update on OSG running for dc2
Curtis A. Meyer
cmeyer at cmu.edu
Mon Apr 7 08:41:29 EDT 2014
Hi Richard -
thanks for the update on the OSG. That is very interesting on what happens when you went above 10k cores.
Curtis
---------
Curtis A. Meyer MCS Associate Dean for Faculty and Graduate Affairs
Wean: (412) 268-2745 Professor of Physics
Doherty: (412) 268-3090 Carnegie Mellon University
Fax: (412) 681-0648 Pittsburgh, PA 15213
curtis.meyer at cmu.edu http://www.curtismeyer.com/
On Apr 7, 2014, at 8:27 AM, Richard Jones <richard.t.jones at uconn.edu> wrote:
> Hello dc2 crew,
>
> As you can see from the graph of running processes,
>
> http://gryphn.phys.uconn.edu/vofrontend/monitor/frontendStatus.html
>
> we hit an instability in the OSG production at around 11,000 cores. To get the time on the graph right, you need to select the "-4 hr" timezone from the menu at the bottom of the page.
>
> We had been running on Saturday around 10k cores for a few hours and everything seemed ok, so I decided to bump up our maximum cores request to 12k. I did this around 6:00pm. Immediately we started getting more cores, which looked great. But as soon as the running process count hit around 10.5k I started seeing big fluctuations in the swap rate on the submit host. By the time we hit 11k running processes, my submit host had reached >50% of cpu time spent swapping, which quickly led to a runaway situation where running processes were queuing up for attention, leading to more swapping, etc. Within 10 minutes the submit host had reached a cpu load of 4200 processes, and 99.7% cpu time spent swapping. If the submit host goes away, the jobs automatically exit after some timeout currently around 10 minutes, so within 15 minutes our production had cleared out.
>
> I restarted production with a cap at 9k cores just to be conservative, and after that ran smoothly for 24 hours, I have increased it to 10k cores. I plan to run with this throttle in place until a new shipment of ram for my submit host arrives later on this week.
>
> This is why we do these exercises, to find out where the critical points are and resolve them, right?
>
> -Richard Jones
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/halld-offline/attachments/20140407/a538dd32/attachment.html
More information about the Halld-offline
mailing list