[Jlab-scicomp-briefs] ifarm Node Maintenance and Upcoming Changes
Wesley Moore
wmoore at jlab.org
Mon Oct 9 10:05:04 EDT 2023
As you may have experienced in recent weeks, our ifarm nodes have been encountering multiple issues, primarily related to work disk caused by InfiniBand dropouts and memory fragmentation. These issues have necessitated periodic reboots and other manual intervention to restore normal system operations.
Despite our best efforts, we have been unable to resolve these ongoing issues. Therefore on maintenance day, October 17th, we are scheduled to implement additional hardware configuration changes aimed at addressing these issues effectively.
This change involves upgrading the InfiniBand cards, which will require a brief outage of the systems. We understand the importance of minimizing disruptions, and as such, we will publish a schedule for this outage in advance to allow for proper planning.
In the interim, we will continue to perform reboots as necessary to maintain system operability. Wall warning messages will continue to be used for immediate notifications of any required reboots, but we will also notify this mailing list to keep you informed about the ongoing status and updates.
We appreciate your patience and cooperation as we work to resolve these issues and improve the reliability of our ifarm nodes. If you have any questions or concerns, please do not hesitate to reach out to our support team.
Thank you for your understanding.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/attachments/20231009/6a0c9684/attachment.html>
More information about the Jlab-scicomp-briefs
mailing list