[Jlab-scicomp-briefs] Update on Infiniband and Lustre availability
Bryan Hess
bhess at jlab.org
Wed Jul 28 09:49:02 EDT 2021
An update on recent events affecting farm availability:
* An InfiniBand (IB) switch failure yesterday morning around 7:30am was the result of a firmware bug that causes SSD to wear out suddenly. The vendor has furnished a candidate fix which we will test and deploy as soon as possible.
* The IB switch failure and replacement caused stability issues with Lustre yesterday that were resolved during the day.
* Last night at 7:50pm an unrelated hardware failure on a Lustre storage server caused the overnight Lustre outage. The system was recovered this morning and analysis of the root cause continues.
The farm is now processing jobs again.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/attachments/20210728/7fe1cbe9/attachment.html>
More information about the Jlab-scicomp-briefs
mailing list