[Jlab-scicomp-briefs] Update on Infiniband and Lustre availability

Bryan Hess bhess at jlab.org
Wed Jul 28 09:49:02 EDT 2021


An update on recent events affecting farm availability:


  *   An InfiniBand (IB) switch failure yesterday morning around 7:30am was the result of a firmware bug that causes SSD to wear out suddenly. The vendor has furnished a candidate fix which we will test and deploy as soon as possible.


  *   The IB switch failure and replacement caused stability issues with Lustre yesterday that were resolved during the day.


  *   Last night at 7:50pm an unrelated hardware failure on a Lustre storage server caused the overnight Lustre outage. The system was recovered this morning and analysis of the root cause continues.

The farm is now processing jobs again.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/attachments/20210728/7fe1cbe9/attachment.html>


More information about the Jlab-scicomp-briefs mailing list