[Jlab-scicomp-briefs] Summary of recent filesystem/lustre issues
Wesley Moore
wmoore at jlab.org
Thu Dec 4 10:33:39 EST 2025
We wanted to provide a brief summary of the recent stability issues some users have experienced.
The primary cause has been failing 20TB disks. When these disks begin to die, they can hang in a way that causes ZFS to suspend I/O, which in turn leads to system-wide filesystem stalls and unresponsiveness.
In addition, we also identified and replaced a bad memory module (DIMM) in one of the SCOSS systems, which likely caused — or contributed to — several of the crashes we observed.
These hardware failures together were the major sources of our recent problems. We will continue monitoring closely.
Best regards,
Wesley, on behalf of the Scientific Computing Operations Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/attachments/20251204/11dda891/attachment.htm>
More information about the Jlab-scicomp-briefs
mailing list