[Halld-offline] Fwd: An update on ongoing issues
marki at jlab.org
Sat Feb 13 14:11:24 EST 2021
News from Bryan Hess, FYI. He addresses interactive response on the
ifarms and the situation with tape file recovery.
-------- Forwarded Message --------
Subject: An update on ongoing issues
Date: Fri, 12 Feb 2021 16:07:12 -0500
From: Bryan Hess <bhess at jlab.org>
To: Brad Sawatzky <brads at jlab.org>, Ole Hansen <ole at jlab.org>, Nathan
Baltzell <baltzell at jlab.org>, Mark Ito <marki at jlab.org>
CC: Graham Heyes <heyes at jlab.org>, Andy Kowalski <kowalski at jlab.org>,
David Lawrence <davidl at jlab.org>, Jie Chen <chen at jlab.org>
I owe you updates on a several topics after our meeting last week.
We've been looking into reports of intermittent slow interactive
response on the ifarm/farm networks . We are making three change that
address aspects of this:
1. The scientific computing firewall will be replaced Tuesday as part
of scheduled monthly maintenance. The switchover is quick and will
increase bandwidth to from 1 to 10Gigabit/sec for firewalled
connections. I will get a full maintenance reminder out soon.
2. NIS and LDAP lookups have been intermittently slow. We are
implementing lookup caching on ifarm systems as a mitigation while
we look for the root cause.
3. There is a procurement in progress for a new pair of work file
servers dedicated to ENP (no more shared failover with LQCD). Once
we take delivery, we will configure some more modern filesystem
features including a better SSD caching layer.
Tape Library News:
We met with IBM about the corroded tape drive problem again today. There
were no unexpected findings from the chemical analysis of the head
corrosion, which rules out some causes. Electron microscopy shows severe
corrosion on only one drive so far. Two more drives are being
disassembled for teh same analysis.
The working theory is that that the tape errors stem from one bad drive
and its corrosion was exacerbated by higher-than-normal humidity
(although it was in spec). As a result, we are installing additional
humidity monitors in the library and test coupons to check for corrosive
agents. We are working with facilities management to decrease peak
humidity in the data center, aiming for 30% - 50% as the operating range.
It is possible that we are past this problem, but out of caution we will
continue to write duplicate tapes of all new data for the interim. We
will continue to do a verify pass on all data during the period when we
believe the bad tape drive was in service. These two tasks are placing
additional load on the tape library, which will increase service time
for a while.
Our next steps will depend on what we find from IBM's next drive analysis.
Recovery of the initial 18 volumes continues. IBM needs ~96 hours per
tape recovery because of the block-by-block process between a pair of
That's the news for Friday. Have a good weekend.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Halld-offline