[Halld-offline] ENP consumption of disk space under /work
Mark Ito
marki at jlab.org
Thu Jun 8 16:49:36 EDT 2017
Chip,
Indeed we do have to choose! You are quite right. Here are my notes and
conclusion.
On 5/31 the choices were:
(1) Physics cuts its use of /work by a factor of 4-5.
(2) Physics increases funding to $40K
(3) We pull a server out of Lustre, decreasing Physics' share of the
system, and use that as half of the new active-active pair, beefing it
up with SSDs and perhaps additional memory; this would actually shrink
Physics near term costs, but puts higher pressure on the file system for
the farm
On 6/1:
Option 3: shrink /work 2x (to 172 TB), freeze cache & volatile, ~$5K
(performance hit on the farm)
new system has 2 JBODS, 2 controllers per head
might need to temporarily shrink space by 190 TB
on top of the 2x to drain a node to re-purpose it
if LQCD needs its disk space back
Option 1: shrink /work 4x and increase cache and volatile by 25%
~$15K
(Chip's recommendation)
new system has 1 JBOD and can be expanded in the future
Option 2: /work as is (340 TB), increase cache & volatile by 25%,
~$40K
(over budget)
needs 3 JBODs for the new system, 2 cascaded for ENP
and also on 6/1:
We have NO intention of keeping /work on Lustre.
On 6/7:
So, in the absence of money (which clearly seems to be the case),
do you choose (a) reduce your use of work by 1.7x, or (b) reduce
your use of cache + volatile by 25%. There is no middle case.
And today, 6/8:
I'll offer one more option: we can use 10TB drives instead of
8TB.
and also today:
Default choice if no one else votes is now this: new /work server
is procured with 44 drives, 10 TB HGST He10 enterprise drives (top
rated, new big brother to our He8 drives).
So I am confused....
...but I think I like this last default choice. Let's see if I
understand it:
a) /work _not_ on Lustre
b) 44 TB total /work for Hall D
c) Hall D /cache and /volatile about the same as now or they grow a bit
d) we can afford this with this year's money
If I have that right[?!], then the default is fine. SSD's would be nice
of course, but I don't think they are critical since we are compute
bound. The big advance for us would be a more reliable work disk at the
cost of slightly less of it.
And don't get me wrong, I do appreciate the effort in coming up with a
solution. :-)
-- Mark
On 06/08/2017 11:42 AM, Chip Watson wrote:
>
> Mark,
>
> Sorry, you don't get off that easy :P. You have to choose, and not
> choosing means you choose what follows.
>
> I'll offer one more option: we can use 10TB drives instead of 8TB.
> Probably slightly longer rebuild times, but actually higher streaming
> bandwidth. Cost will go up, maybe a few $K. I think we can find a
> way to squeeze that out of the system.
>
> Default choice if no one else votes is now this: new /work server is
> procured with 44 drives, 10 TB HGST He10 enterprise drives (top rated,
> new big brother to our He8 drives). If Physics is poor, we can defer
> buying them the read cache SSD to compensate for the higher price of
> 10TB (Graham, can you swing an extra $3K?).
>
> 21 drives for ENP, 21 for LQCD, 2 hot spares in a 44 drive
> enclosure (SSDs in hosts)
>
> ENP and LQCD each get total quota (enforced) of 108 TB
>
> GlueX, CLAS-12 each get 44 TB, A and C get 10 TB (enforced)
>
> CLAS-12 says they can live within this, and C believes they are
> already there. GlueX and A will need to figure out what can be moved
> into /volatile.
>
> If requested, we can give you modest lifetime "pin" for /volatile so
> some large data files that should not go to tape can have longer
> lifetimes in Lustre /volatile than they can today.
>
> Next year, for about $10K we can add a second cascaded JBOD next year
> with one more RAID stripe for ENP. We'll also need to spend $25K to
> replace the Lustre storage that ages out in FY18 (or suffer the
> reduction).
>
> regards,
>
> Chip
>
>
> On 6/8/17 9:40 AM, Mark Ito wrote:
>>
>> That was just an observation. I would not interpret it as a vote for
>> anything... :-)
>>
>>
>> On 06/08/2017 09:33 AM, Chip Watson wrote:
>>>
>>> I'll take this as a vote by GlueX to have more work and reduce cache.
>>>
>>> Do A,B,C concur?
>>>
>>>
>>> On 6/8/17 9:28 AM, Mark Ito wrote:
>>>>
>>>> In my previous estimate, of the cache portion, 278 TB, only 105 TB
>>>> of that is pinned. The unpinned part is presumeably old files that
>>>> should be gone, but have not been deleted since there happens to be
>>>> no demand for the space. If we use 105 TB as our cache usage then
>>>> re-doing your estimate gives 555 TB, which means in 9 months we
>>>> will have 270 TB of unused space. Which would mean that we have
>>>> room to increase our usage without buying anything!
>>>>
>>>>
>>>> On 06/07/2017 05:54 PM, Chip Watson wrote:
>>>>>
>>>>> Mark,
>>>>>
>>>>> I still need you to answer the question of how to further reduce
>>>>> usage and how to configure. Your usage as you report it is about
>>>>> 370 TB. Assuming that Hall B needs the same within 9 months, and
>>>>> that A+C need half as much, then that leads to a total of 925TB
>>>>> which is more than Physics owns, by 100 TB (NOT CURRENT USAGE,
>>>>> JUST PROJECTION BASED ON GLUEX USAGE).
>>>>>
>>>>> There is also the question of how to split the storage budget. In
>>>>> budget, you can have new half a JBOD: 21 disks configured as 3
>>>>> RAID z2 stripes of 5+2 disks, 8TB, thus 120 raw data, 108 in a
>>>>> file system, and 86 TB at 80% -- for all of GlueX, CLAS-12, A and
>>>>> C. If GlueX is 40% of the total, that makes 35TB, and you are
>>>>> still high by 70%.
>>>>>
>>>>> The other low cost option is to re-purpose a 2016 Lustre node so
>>>>> that /work is twice this size (one full JBOD), and GlueX can use
>>>>> 70TB as /work. But then you must reduce /cache + /volatile by a
>>>>> comparable amount since we have to pull a node out of production.
>>>>> And this still isn't free since we'll need a total of 4 RAID cards
>>>>> instead of 2 to provide correct performance, and we'll need to add
>>>>> SSD's to the mix.
>>>>>
>>>>> So, in the absence of money (which clearly seems to be the case),
>>>>> do you choose (a) reduce your use of work by 1.7x, or (b) reduce
>>>>> your use of cache + volatile by 25%. There is no middle case.
>>>>>
>>>>> thanks,
>>>>>
>>>>> Chip
>>>>>
>>>>>
>>>>> On 6/7/17 5:30 PM, Mark Ito wrote:
>>>>>>
>>>>>> Summarizing Hall D work disk usage (/work/halld only):
>>>>>>
>>>>>> o using du, today 2017-06-06, 59 TB
>>>>>>
>>>>>> o from our disk-management database, a couple of days ago,
>>>>>> 2017-06-04, 86 TB
>>>>>>
>>>>>> I also know that one of our students got rid of about 20 TB of
>>>>>> unneeded files yesterday. That accounts for part of the drop.
>>>>>>
>>>>>> We produce a report from that database
>>>>>> <https://halldweb.jlab.org/disk_management/work_report.html> that
>>>>>> is updated every few days.
>>>>>>
>>>>>> From the SciComp pages, Hall D is using 287 TB on cache and 21 TB
>>>>>> on volatile.
>>>>>>
>>>>>> My view is that this level of work disk usage is more or less as
>>>>>> expected, consistent with our previous estimates, and not
>>>>>> particularly abusive. That having been said, I am sure there is a
>>>>>> lot that can be cleaned up. But as Ole pointed out, disk usage
>>>>>> grows naturally and we were not aware that this was a problem. I
>>>>>> seem to recall that we agreed to respond to emails that would be
>>>>>> sent when we reached 90% of too much, no? Was the email sent out?
>>>>>>
>>>>>> One mystery: when I ask Lustre what we are using I get:
>>>>>>
>>>>>> ifarm1402:marki:marki> lfs quota -gh halld /lustre
>>>>>> Disk quotas for group halld (gid 267):
>>>>>> Filesystem used quota limit grace files quota limit grace
>>>>>> /lustre 290T 470T 500T - 15106047 0 0 -
>>>>>> which is less than cache + volatile, not to mention work. I
>>>>>> thought that to a good approximation this 290 TB should be the
>>>>>> sum of all three. What am I missing?
>>>>>>
>>>>>> On 05/31/2017 10:35 AM, Chip Watson wrote:
>>>>>>> All,
>>>>>>>
>>>>>>> As I have started on the procurement of the new /work file
>>>>>>> server, I have discovered that Physics' use of /work has grown
>>>>>>> unrestrained over the last year or two.
>>>>>>>
>>>>>>> "Unrestrained" because there is no way under Lustre to restrain
>>>>>>> it except via a very unfriendly Lustre quota system. As we
>>>>>>> leave some quota headroom to accommodate large swings in usage
>>>>>>> for each hall for cache and volatile, then /work continues to grow.
>>>>>>>
>>>>>>> Total /work has now reached 260 TB, several times larger than I
>>>>>>> was anticipating. This constitutes more than 25% of Physics'
>>>>>>> share of Lustre, compared to LQCD which uses less than 5% of its
>>>>>>> disk space on the un-managed /work.
>>>>>>>
>>>>>>> It would cost Physics an extra $25K (total $35K - $40K) to treat
>>>>>>> the 260 TB as a requirement.
>>>>>>>
>>>>>>> There are 3 paths forward:
>>>>>>>
>>>>>>> (1) Physics cuts its use of /work by a factor of 4-5.
>>>>>>> (2) Physics increases funding to $40K
>>>>>>> (3) We pull a server out of Lustre, decreasing Physics' share of
>>>>>>> the system, and use that as half of the new active-active pair,
>>>>>>> beefing it up with SSDs and perhaps additional memory; this
>>>>>>> would actually shrink Physics near term costs, but puts higher
>>>>>>> pressure on the file system for the farm
>>>>>>>
>>>>>>> The decision is clearly Physics', but I do need a VERY FAST
>>>>>>> response to this question, as I need to move quickly now for
>>>>>>> LQCD's needs.
>>>>>>>
>>>>>>> Hall D + GlueX, 96 TB
>>>>>>> CLAS + CLAS12, 98 TB
>>>>>>> Hall C, 35 TB
>>>>>>> Hall A <unknown, still scanning>
>>>>>>>
>>>>>>> Email, call (x7101), or drop by today 1:30-3:00 p.m. for
>>>>>>> discussion.
>>>>>>>
>>>>>>> thanks,
>>>>>>> Chip
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mark Ito,marki at jlab.org, (757)269-5295
>>>>>
>>>>
>>>> --
>>>> Mark Ito,marki at jlab.org, (757)269-5295
>>>
>>
>> --
>> Mark Ito,marki at jlab.org, (757)269-5295
>
--
Mark Ito, marki at jlab.org, (757)269-5295
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20170608/4f86f557/attachment-0002.html>
More information about the Halld-offline
mailing list