[Jlab-scicomp-briefs] Completion of AlmaLinux 9 Farm Transition; Lustre stability update

Bryan Hess bhess at jlab.org
Wed Jul 3 08:21:03 EDT 2024


July 16 Maintenance Day for Alma9
The transition of the JLab compute farm from CentOS 7 to AlmaLinux 9 (EL9) has been completed. Next maintenance day two user-visible changes will made:
First, The DNS alias ifarm.jlab.org will point to new EL9 nodes, ifarm2401 and ifarm2402. The ifarm7 nodes will be retired.
Second, to complete the AlmaLinux migration in SLURM, we will be removing the el7​ feature. This may require modification to your SLURM job submissions. Here is how to plan now for the change:

  1.
Review SLURM Job Submission Scripts:
     *
Identify where the el7 constraint is being specified in your SLURM job submission scripts. This is typically done using the -C or --constraint flag followed by the constraint specification. For example: --constraints=el7|el9​
     *
Remove the el7 constraint from your job submission commands. If you have other constraints, adjust accordingly.

Failure to remove the el7​ constraint after maintenance day will result in failed job submissions. The resulting error will be similar to:

  Unable to allocate resources: Invalid feature specification​



Farm Lustre Status

The June issue with Lustre on the farm has been identified and stabilized. The cause was traced to a bug with Connect-X 3 InfiniBand card driver in older farm nodes that caused server crashes when both the old (lustre19) and new (lustre24) filesystems were mounted. Planning is underway for a transition from the existing 5PB Lustre system to the new 11PB Lustre system that works around this issue. A schedule for this change will be announced once it is complete.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/attachments/20240703/c25f5e5a/attachment.html>


More information about the Jlab-scicomp-briefs mailing list