[Halld-offline] Fwd: An update on ongoing issues

Sat Feb 13 14:11:24 EST 2021

News from Bryan Hess, FYI. He addresses interactive response on the 
ifarms and the situation with tape file recovery.

-------- Forwarded Message --------
Subject: 	An update on ongoing issues
Date: 	Fri, 12 Feb 2021 16:07:12 -0500
From: 	Bryan Hess <bhess at jlab.org>
To: 	Brad Sawatzky <brads at jlab.org>, Ole Hansen <ole at jlab.org>, Nathan 
Baltzell <baltzell at jlab.org>, Mark Ito <marki at jlab.org>
CC: 	Graham Heyes <heyes at jlab.org>, Andy Kowalski <kowalski at jlab.org>, 
David Lawrence <davidl at jlab.org>, Jie Chen <chen at jlab.org>

I owe you updates on a several topics after our meeting last week.

Interactive Response:

We've been looking into reports of intermittent slow interactive 
response on the ifarm/farm networks . We are making three change that 
address aspects of this:

 1. The scientific computing firewall will be replaced Tuesday as part
    of scheduled monthly maintenance. The switchover is quick and will
    increase bandwidth to from 1 to 10Gigabit/sec for firewalled
    connections. I will get a full maintenance reminder out soon.
 2. NIS and LDAP lookups have been intermittently slow. We are
    implementing lookup caching on ifarm systems as a mitigation while
    we look for the root cause.
 3. There is a procurement in progress for a new pair of work file
    servers dedicated to ENP (no more shared failover with LQCD). Once
    we take delivery, we will configure some more modern filesystem
    features including a better SSD caching layer.

Tape Library News:

We met with IBM about the corroded tape drive problem again today. There 
were no unexpected findings from the chemical analysis of the head 
corrosion, which rules out some causes. Electron microscopy shows severe 
corrosion on only one drive so far. Two more drives are being 
disassembled for teh same analysis.

The working theory is that that the tape errors stem from one bad drive 
and its corrosion was exacerbated by higher-than-normal humidity 
(although it was in spec). As a result, we are installing additional 
humidity monitors in the library and test coupons to check for corrosive 
agents. We are working with facilities management to decrease peak 
humidity in the data center, aiming for 30% - 50% as the operating range.

It is possible that we are past this problem, but out of caution we will 
continue to write duplicate tapes of all new data for the interim. We 
will continue to do a verify pass on all data during the period when we 
believe the bad tape drive was in service. These two tasks are placing 
additional load on the tape library, which will increase service time 
for a while.

Our next steps will depend on what we find from IBM's next drive analysis.

Recovery of the initial 18 volumes continues. IBM needs ~96 hours per 
tape recovery because of the block-by-block process between a pair of 
drives.

That's the news for Friday. Have a good weekend.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20210213/14fa4608/attachment.html>