[lqcd-users] JLab LQCD job failures and Tuesday Feb 20 Maintenance
Bryan Hess
bhess at jlab.org
Wed Feb 21 13:38:30 EST 2024
On Tuesday Feb 20, as a part of the scheduled monthly maintenance window, Slurm for LQCD was upgraded to version 22. This was done in preparation for the 24s systems to support AlmaLinux 9. As part of pre-maintenance testing, the existing controller was cloned to a test system and the Slurm upgrade was vetted in a staging environment to ensure that the maintenance day work was fully understood ahead of time. Unfortunately the test system interacted with the production system and caused jobs to fail, draining the queues. Normally controls prevent the test system from interacting with the production system, and this is not the first time this approach has been used. We will examine the differences that led to the job failures. The Slurm upgrade proceeded without further issue and systems are now operating normally.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/lqcd-users/attachments/20240221/b9a121f8/attachment.html>
More information about the lqcd-users
mailing list