[Halld-offline] ENP consumption of disk space under /work

Thu Jun 8 09:40:19 EDT 2017

That was just an observation. I would not interpret it as a vote for 
anything... :-)

On 06/08/2017 09:33 AM, Chip Watson wrote:
>
> I'll take this as a vote by GlueX to have more work and reduce cache.
>
> Do A,B,C concur?
>
>
> On 6/8/17 9:28 AM, Mark Ito wrote:
>>
>> In my previous estimate, of the cache portion, 278 TB, only 105 TB of 
>> that is pinned. The unpinned part is presumeably old files that 
>> should be gone, but have not been deleted since there happens to be 
>> no demand for the space. If we use 105 TB as our cache usage then 
>> re-doing your estimate gives 555 TB, which means in 9 months we will 
>> have 270 TB of unused space. Which would mean that we have room to 
>> increase our usage without buying anything!
>>
>>
>> On 06/07/2017 05:54 PM, Chip Watson wrote:
>>>
>>> Mark,
>>>
>>> I still need you to answer the question of how to further reduce 
>>> usage and how to configure.  Your usage as you report it is about 
>>> 370 TB.  Assuming that Hall B needs the same within 9 months, and 
>>> that A+C need half as much, then that leads to a total of 925TB 
>>> which is more than Physics owns, by 100 TB (NOT CURRENT USAGE, JUST 
>>> PROJECTION BASED ON GLUEX USAGE).
>>>
>>> There is also the question of how to split the storage budget.  In 
>>> budget, you can have new half a JBOD: 21 disks configured as 3 RAID 
>>> z2 stripes of 5+2 disks, 8TB, thus 120 raw data, 108 in a file 
>>> system, and 86 TB at 80% -- for all of GlueX, CLAS-12, A and C.  If 
>>> GlueX is 40% of the total, that makes 35TB, and you are still high 
>>> by 70%.
>>>
>>> The other low cost option is to re-purpose a 2016 Lustre node so 
>>> that /work is twice this size (one full JBOD), and GlueX can use 
>>> 70TB as /work.  But then you must reduce /cache + /volatile by a 
>>> comparable amount since we have to pull a node out of production.  
>>> And this still isn't free since we'll need a total of 4 RAID cards 
>>> instead of 2 to provide correct performance, and we'll need to add 
>>> SSD's to the mix.
>>>
>>> So, in the absence of money (which clearly seems to be the case), do 
>>> you choose (a) reduce your use of work by 1.7x, or (b) reduce your 
>>> use of cache + volatile by 25%.  There is no middle case.
>>>
>>> thanks,
>>>
>>> Chip
>>>
>>>
>>> On 6/7/17 5:30 PM, Mark Ito wrote:
>>>>
>>>> Summarizing Hall D work disk usage (/work/halld only):
>>>>
>>>> o using du, today 2017-06-06, 59 TB
>>>>
>>>> o from our disk-management database, a couple of days ago, 
>>>> 2017-06-04, 86 TB
>>>>
>>>> I also know that one of our students got rid of about 20 TB of 
>>>> unneeded files yesterday. That accounts for part of the drop.
>>>>
>>>> We produce a report from that database 
>>>> <https://halldweb.jlab.org/disk_management/work_report.html> that 
>>>> is updated every few days.
>>>>
>>>> From the SciComp pages, Hall D is using 287 TB on cache and 21 TB 
>>>> on volatile.
>>>>
>>>> My view is that this level of work disk usage is more or less as 
>>>> expected, consistent with our previous estimates, and not 
>>>> particularly abusive. That having been said, I am sure there is a 
>>>> lot that can be cleaned up. But as Ole pointed out, disk usage 
>>>> grows naturally and we were not aware that this was a problem. I 
>>>> seem to recall that we agreed to respond to emails that would be 
>>>> sent when we reached 90% of too much, no? Was the email sent out?
>>>>
>>>> One mystery: when I ask Lustre what we are using I get:
>>>>
>>>> ifarm1402:marki:marki>   lfs quota -gh halld /lustre
>>>> Disk quotas for group halld (gid 267):
>>>>       Filesystem    used   quota   limit   grace   files   quota   limit   grace
>>>>          /lustre    290T    470T    500T       - 15106047       0       0       -
>>>> which is less than cache + volatile, not to mention work. I thought 
>>>> that to a good approximation this 290 TB should be the sum of all 
>>>> three. What am I missing?
>>>>
>>>> On 05/31/2017 10:35 AM, Chip Watson wrote:
>>>>> All,
>>>>>
>>>>> As I have started on the procurement of the new /work file server, 
>>>>> I have discovered that Physics' use of /work has grown 
>>>>> unrestrained over the last year or two.
>>>>>
>>>>> "Unrestrained" because there is no way under Lustre to restrain it 
>>>>> except via a very unfriendly Lustre quota system.  As we leave 
>>>>> some quota headroom to accommodate large swings in usage for each 
>>>>> hall for cache and volatile, then /work continues to grow.
>>>>>
>>>>> Total /work has now reached 260 TB, several times larger than I 
>>>>> was anticipating.  This constitutes more than 25% of Physics' 
>>>>> share of Lustre, compared to LQCD which uses less than 5% of its 
>>>>> disk space on the un-managed /work.
>>>>>
>>>>> It would cost Physics an extra $25K (total $35K - $40K) to treat 
>>>>> the 260 TB as a requirement.
>>>>>
>>>>> There are 3 paths forward:
>>>>>
>>>>> (1) Physics cuts its use of /work by a factor of 4-5.
>>>>> (2) Physics increases funding to $40K
>>>>> (3) We pull a server out of Lustre, decreasing Physics' share of 
>>>>> the system, and use that as half of the new active-active pair, 
>>>>> beefing it up with SSDs and perhaps additional memory; this would 
>>>>> actually shrink Physics near term costs, but puts higher pressure 
>>>>> on the file system for the farm
>>>>>
>>>>> The decision is clearly Physics', but I do need a VERY FAST 
>>>>> response to this question, as I need to move quickly now for 
>>>>> LQCD's needs.
>>>>>
>>>>> Hall D + GlueX,  96 TB
>>>>> CLAS + CLAS12, 98 TB
>>>>> Hall C,                35 TB
>>>>> Hall A <unknown, still scanning>
>>>>>
>>>>> Email, call (x7101), or drop by today 1:30-3:00 p.m. for discussion.
>>>>>
>>>>> thanks,
>>>>> Chip
>>>>>
>>>>
>>>> -- 
>>>> Mark Ito,marki at jlab.org, (757)269-5295
>>>
>>
>> -- 
>> Mark Ito,marki at jlab.org, (757)269-5295
>

-- 
Mark Ito, marki at jlab.org, (757)269-5295

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20170608/054c5c09/attachment-0002.html>