[Jlab-scicomp-briefs] June 10 afternoon update on the farm Lustre	stability Issue
    Bryan Hess 
    bhess at jlab.org
       
    Mon Jun 10 15:56:34 EDT 2024
    
    
  
June 10 afternoon update on the farm Lustre ( /cache and /volatile) stability issue:
Today's unplanned maintenance work focused on the search for InfiniBand-related causes of the Lustre stabilitiy issue. Work included switch reloads, host adapter firmware updates, and examination of intra-switch links for errors or capacity problems. Evidence continues to point to a network layer bug that causes the servers to reboot.
Data Stored on Lustre remains safe. We have had no issues with the underlying disk storage subsystem. Jobs submitted to the farm with SWIF or Slurm will queue during periods of debugging. Please continue to submit work and we will release it as we are able.
As of this writing, Lustre is running with a reduced number of servers and a reduced number of farm worker nodes. We will run in this state overnight. Tomorrow morning  we will evaluate next steps and provide another update.
--
This message is sent to an automated list of people with farm and ifarm access.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/attachments/20240610/3dacc4bb/attachment.html>
    
    
More information about the Jlab-scicomp-briefs
mailing list