[Hps] [Jlab-scicomp-briefs] Update on Infiniband and Lustre availability

Bryan Hess bhess at jlab.org
Wed Jul 28 09:49:02 EDT 2021


An update on recent events affecting farm availability:


  *   An InfiniBand (IB) switch failure yesterday morning around 7:30am was the result of a firmware bug that causes SSD to wear out suddenly. The vendor has furnished a candidate fix which we will test and deploy as soon as possible.


  *   The IB switch failure and replacement caused stability issues with Lustre yesterday that were resolved during the day.


  *   Last night at 7:50pm an unrelated hardware failure on a Lustre storage server caused the overnight Lustre outage. The system was recovered this morning and analysis of the root cause continues.

The farm is now processing jobs again.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/hps/attachments/20210728/7fe1cbe9/attachment.html>
-------------- next part --------------
_______________________________________________
Jlab-scicomp-briefs mailing list
Jlab-scicomp-briefs at jlab.org
https://mailman.jlab.org/mailman/listinfo/jlab-scicomp-briefs


More information about the Hps mailing list