[Jlab-scicomp-briefs] Update on Infiniband and Lustre availability
    Bryan Hess 
    bhess at jlab.org
       
    Wed Jul 28 09:49:02 EDT 2021
    
    
  
An update on recent events affecting farm availability:
  *   An InfiniBand (IB) switch failure yesterday morning around 7:30am was the result of a firmware bug that causes SSD to wear out suddenly. The vendor has furnished a candidate fix which we will test and deploy as soon as possible.
  *   The IB switch failure and replacement caused stability issues with Lustre yesterday that were resolved during the day.
  *   Last night at 7:50pm an unrelated hardware failure on a Lustre storage server caused the overnight Lustre outage. The system was recovered this morning and analysis of the root cause continues.
The farm is now processing jobs again.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/attachments/20210728/7fe1cbe9/attachment.html>
    
    
More information about the Jlab-scicomp-briefs
mailing list