[Moller] [Jlab-scicomp-briefs] Summary of recent filesystem/lustre issues

Wesley Moore via Jlab-scicomp-briefs jlab-scicomp-briefs at jlab.org
Thu Dec 4 10:33:39 EST 2025


We wanted to provide a brief summary of the recent stability issues some users have experienced.
The primary cause has been failing 20TB disks. When these disks begin to die, they can hang in a way that causes ZFS to suspend I/O, which in turn leads to system-wide filesystem stalls and unresponsiveness.
In addition, we also identified and replaced a bad memory module (DIMM) in one of the SCOSS systems, which likely caused — or contributed to — several of the crashes we observed.
These hardware failures together were the major sources of our recent problems. We will continue monitoring closely.
Best regards,
Wesley, on behalf of the Scientific Computing Operations Team

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/moller/attachments/20251204/11dda891/attachment.htm>
-------------- next part --------------
--

This is an announcement-only list for Jefferson Lab Scientific Computing Updates .

Subscription and List Archive: https://mailman.jlab.org/mailman/listinfo/jlab-scicomp-briefs

For help: https://jlab.servicenowservices.com/scicomp


More information about the Moller mailing list