[Halld-offline] ENP consumption of disk space under /work

Thu Jun 8 16:49:36 EDT 2017

Chip,

Indeed we do have to choose! You are quite right. Here are my notes and 
conclusion.

On 5/31 the choices were:

(1) Physics cuts its use of /work by a factor of 4-5.

(2) Physics increases funding to $40K

(3) We pull a server out of Lustre, decreasing Physics' share of the
system, and use that as half of the new active-active pair, beefing it
up with SSDs and perhaps additional memory; this would actually shrink
Physics near term costs, but puts higher pressure on the file system for
the farm

On 6/1:

     Option 3: shrink /work 2x (to 172 TB), freeze cache & volatile, ~$5K
                     (performance hit on the farm)
                     new system has 2 JBODS, 2 controllers per head
                     might need to temporarily shrink space by 190 TB
                     on top of the 2x to drain a node to re-purpose it
                     if LQCD needs its disk space back

     Option 1: shrink /work 4x and increase cache and volatile by 25%
~$15K
                     (Chip's recommendation)
                     new system has 1 JBOD and can be expanded in the future

     Option 2: /work as is (340 TB), increase cache & volatile by 25%,
~$40K
                     (over budget)
                      needs 3 JBODs for the new system, 2 cascaded for ENP

and also on 6/1:

We have NO intention of keeping /work on Lustre.

On 6/7:

So, in the absence of money (which clearly seems to be the case),
       do you choose (a) reduce your use of work by 1.7x, or (b) reduce
       your use of cache + volatile by 25%.  There is no middle case.

And today, 6/8:

I'll offer one more option: we can use 10TB drives instead of
       8TB.

and also today:

Default choice if no one else votes is now this: new /work server
       is procured with 44 drives, 10 TB HGST He10 enterprise drives (top
       rated, new big brother to our He8 drives).

So I am confused....

...but I think I like this last default choice. Let's see if I 
understand it:

a) /work _not_ on Lustre

b) 44 TB total /work for Hall D

c) Hall D /cache and /volatile about the same as now or they grow a bit

d) we can afford this with this year's money

If I have that right[?!], then the default is fine. SSD's would be nice 
of course, but I don't think they are critical since we are compute 
bound. The big advance for us would be a more reliable work disk at the 
cost of slightly less of it.

And don't get me wrong, I do appreciate the effort in coming up with a 
solution. :-)

   -- Mark

On 06/08/2017 11:42 AM, Chip Watson wrote:
>
> Mark,
>
> Sorry, you don't get off that easy :P.  You have to choose, and not 
> choosing means you choose what follows.
>
> I'll offer one more option: we can use 10TB drives instead of 8TB.  
> Probably slightly longer rebuild times, but actually higher streaming 
> bandwidth.  Cost will go up, maybe a few $K.  I think we can find a 
> way to squeeze that out of the system.
>
> Default choice if no one else votes is now this: new /work server is 
> procured with 44 drives, 10 TB HGST He10 enterprise drives (top rated, 
> new big brother to our He8 drives).  If Physics is poor, we can defer 
> buying them the read cache SSD to compensate for the higher price of 
> 10TB (Graham, can you swing an extra $3K?).
>
>         21 drives for ENP, 21 for LQCD, 2 hot spares in a 44 drive 
> enclosure (SSDs in hosts)
>
>         ENP and LQCD each get total quota (enforced) of 108 TB
>
>         GlueX, CLAS-12 each get 44 TB, A and C get 10 TB (enforced)
>
> CLAS-12 says they can live within this, and C believes they are 
> already there.  GlueX and A will need to figure out what can be moved 
> into /volatile.
>
> If requested, we can give you modest lifetime "pin" for /volatile so 
> some large data files that should not go to tape can have longer 
> lifetimes in Lustre /volatile than they can today.
>
> Next year, for about $10K we can add a second cascaded JBOD next year 
> with one more RAID stripe for ENP.  We'll also need to spend $25K to 
> replace the Lustre storage that ages out in FY18 (or suffer the 
> reduction).
>
> regards,
>
> Chip
>
>
> On 6/8/17 9:40 AM, Mark Ito wrote:
>>
>> That was just an observation. I would not interpret it as a vote for 
>> anything... :-)
>>
>>
>> On 06/08/2017 09:33 AM, Chip Watson wrote:
>>>
>>> I'll take this as a vote by GlueX to have more work and reduce cache.
>>>
>>> Do A,B,C concur?
>>>
>>>
>>> On 6/8/17 9:28 AM, Mark Ito wrote:
>>>>
>>>> In my previous estimate, of the cache portion, 278 TB, only 105 TB 
>>>> of that is pinned. The unpinned part is presumeably old files that 
>>>> should be gone, but have not been deleted since there happens to be 
>>>> no demand for the space. If we use 105 TB as our cache usage then 
>>>> re-doing your estimate gives 555 TB, which means in 9 months we 
>>>> will have 270 TB of unused space. Which would mean that we have 
>>>> room to increase our usage without buying anything!
>>>>
>>>>
>>>> On 06/07/2017 05:54 PM, Chip Watson wrote:
>>>>>
>>>>> Mark,
>>>>>
>>>>> I still need you to answer the question of how to further reduce 
>>>>> usage and how to configure.  Your usage as you report it is about 
>>>>> 370 TB.  Assuming that Hall B needs the same within 9 months, and 
>>>>> that A+C need half as much, then that leads to a total of 925TB 
>>>>> which is more than Physics owns, by 100 TB (NOT CURRENT USAGE, 
>>>>> JUST PROJECTION BASED ON GLUEX USAGE).
>>>>>
>>>>> There is also the question of how to split the storage budget.  In 
>>>>> budget, you can have new half a JBOD: 21 disks configured as 3 
>>>>> RAID z2 stripes of 5+2 disks, 8TB, thus 120 raw data, 108 in a 
>>>>> file system, and 86 TB at 80% -- for all of GlueX, CLAS-12, A and 
>>>>> C.  If GlueX is 40% of the total, that makes 35TB, and you are 
>>>>> still high by 70%.
>>>>>
>>>>> The other low cost option is to re-purpose a 2016 Lustre node so 
>>>>> that /work is twice this size (one full JBOD), and GlueX can use 
>>>>> 70TB as /work.  But then you must reduce /cache + /volatile by a 
>>>>> comparable amount since we have to pull a node out of production.  
>>>>> And this still isn't free since we'll need a total of 4 RAID cards 
>>>>> instead of 2 to provide correct performance, and we'll need to add 
>>>>> SSD's to the mix.
>>>>>
>>>>> So, in the absence of money (which clearly seems to be the case), 
>>>>> do you choose (a) reduce your use of work by 1.7x, or (b) reduce 
>>>>> your use of cache + volatile by 25%.  There is no middle case.
>>>>>
>>>>> thanks,
>>>>>
>>>>> Chip
>>>>>
>>>>>
>>>>> On 6/7/17 5:30 PM, Mark Ito wrote:
>>>>>>
>>>>>> Summarizing Hall D work disk usage (/work/halld only):
>>>>>>
>>>>>> o using du, today 2017-06-06, 59 TB
>>>>>>
>>>>>> o from our disk-management database, a couple of days ago, 
>>>>>> 2017-06-04, 86 TB
>>>>>>
>>>>>> I also know that one of our students got rid of about 20 TB of 
>>>>>> unneeded files yesterday. That accounts for part of the drop.
>>>>>>
>>>>>> We produce a report from that database 
>>>>>> <https://halldweb.jlab.org/disk_management/work_report.html> that 
>>>>>> is updated every few days.
>>>>>>
>>>>>> From the SciComp pages, Hall D is using 287 TB on cache and 21 TB 
>>>>>> on volatile.
>>>>>>
>>>>>> My view is that this level of work disk usage is more or less as 
>>>>>> expected, consistent with our previous estimates, and not 
>>>>>> particularly abusive. That having been said, I am sure there is a 
>>>>>> lot that can be cleaned up. But as Ole pointed out, disk usage 
>>>>>> grows naturally and we were not aware that this was a problem. I 
>>>>>> seem to recall that we agreed to respond to emails that would be 
>>>>>> sent when we reached 90% of too much, no? Was the email sent out?
>>>>>>
>>>>>> One mystery: when I ask Lustre what we are using I get:
>>>>>>
>>>>>> ifarm1402:marki:marki>   lfs quota -gh halld /lustre
>>>>>> Disk quotas for group halld (gid 267):
>>>>>>       Filesystem    used   quota   limit   grace   files   quota   limit   grace
>>>>>>          /lustre    290T    470T    500T       - 15106047       0       0       -
>>>>>> which is less than cache + volatile, not to mention work. I 
>>>>>> thought that to a good approximation this 290 TB should be the 
>>>>>> sum of all three. What am I missing?
>>>>>>
>>>>>> On 05/31/2017 10:35 AM, Chip Watson wrote:
>>>>>>> All,
>>>>>>>
>>>>>>> As I have started on the procurement of the new /work file 
>>>>>>> server, I have discovered that Physics' use of /work has grown 
>>>>>>> unrestrained over the last year or two.
>>>>>>>
>>>>>>> "Unrestrained" because there is no way under Lustre to restrain 
>>>>>>> it except via a very unfriendly Lustre quota system.  As we 
>>>>>>> leave some quota headroom to accommodate large swings in usage 
>>>>>>> for each hall for cache and volatile, then /work continues to grow.
>>>>>>>
>>>>>>> Total /work has now reached 260 TB, several times larger than I 
>>>>>>> was anticipating.  This constitutes more than 25% of Physics' 
>>>>>>> share of Lustre, compared to LQCD which uses less than 5% of its 
>>>>>>> disk space on the un-managed /work.
>>>>>>>
>>>>>>> It would cost Physics an extra $25K (total $35K - $40K) to treat 
>>>>>>> the 260 TB as a requirement.
>>>>>>>
>>>>>>> There are 3 paths forward:
>>>>>>>
>>>>>>> (1) Physics cuts its use of /work by a factor of 4-5.
>>>>>>> (2) Physics increases funding to $40K
>>>>>>> (3) We pull a server out of Lustre, decreasing Physics' share of 
>>>>>>> the system, and use that as half of the new active-active pair, 
>>>>>>> beefing it up with SSDs and perhaps additional memory; this 
>>>>>>> would actually shrink Physics near term costs, but puts higher 
>>>>>>> pressure on the file system for the farm
>>>>>>>
>>>>>>> The decision is clearly Physics', but I do need a VERY FAST 
>>>>>>> response to this question, as I need to move quickly now for 
>>>>>>> LQCD's needs.
>>>>>>>
>>>>>>> Hall D + GlueX,  96 TB
>>>>>>> CLAS + CLAS12, 98 TB
>>>>>>> Hall C,                35 TB
>>>>>>> Hall A <unknown, still scanning>
>>>>>>>
>>>>>>> Email, call (x7101), or drop by today 1:30-3:00 p.m. for 
>>>>>>> discussion.
>>>>>>>
>>>>>>> thanks,
>>>>>>> Chip
>>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Mark Ito,marki at jlab.org, (757)269-5295
>>>>>
>>>>
>>>> -- 
>>>> Mark Ito,marki at jlab.org, (757)269-5295
>>>
>>
>> -- 
>> Mark Ito,marki at jlab.org, (757)269-5295
>

-- 
Mark Ito, marki at jlab.org, (757)269-5295

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20170608/4f86f557/attachment-0002.html>