[Halld-offline] update on OSG running for dc2

Mon Apr 7 08:27:50 EDT 2014

Hello dc2 crew,

As you can see from the graph of running processes,

http://gryphn.phys.uconn.edu/vofrontend/monitor/frontendStatus.html

we hit an instability in the OSG production at around 11,000 cores.  To get
the time on the graph right, you need to select the "-4 hr" timezone from
the menu at the bottom of the page.

We had been running on Saturday around 10k cores for a few hours and
everything seemed ok, so I decided to bump up our maximum cores request to
12k.  I did this around 6:00pm. Immediately we started getting more cores,
which looked great. But as soon as the running process count hit around
10.5k I started seeing big fluctuations in the swap rate on the submit
host.  By the time we hit 11k running processes, my submit host had reached
>50% of cpu time spent swapping, which quickly led to a runaway situation
where running processes were queuing up for attention, leading to more
swapping, etc.  Within 10 minutes the submit host had reached a cpu load of
4200 processes, and 99.7% cpu time spent swapping.  If the submit host goes
away, the jobs automatically exit after some timeout currently around 10
minutes, so within 15 minutes our production had cleared out.

I restarted production with a cap at 9k cores just to be conservative, and
after that ran smoothly for 24 hours, I have increased it to 10k cores.  I
plan to run with this throttle in place until a new shipment of ram for my
submit host arrives later on this week.

This is why we do these exercises, to find out where the critical points
are and resolve them, right?

-Richard Jones
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/halld-offline/attachments/20140407/027886e1/attachment.html