[Jlab-scicomp-briefs] Compute Farm Updates
Bryan Hess
bhess at jlab.org
Fri Oct 22 11:46:59 EDT 2021
Upcoming Changes to the Scientific Computing Environment
On Tuesday Morning, October 26th the following changes will be made to the farm:
* ifarm1801 will be patched from CentOS 7.7 to CentOS 7.9 and rebooted.
* all farm13 nodes (but only farm13 nodes) will be patched from CentOS 7.7 to CentOS 7.9 and rebooted.
* OSG tools will be updated on 7.9 nodes.
Updates to 7.9 are prerequisites for CUDA and Open Science Grid tools. We are increasing the frequency of patching to improve cybersecurity and be more respond to the grid computing software updates.
Recent Changes to the Scientific Computing Environment
* On the October 19th Maintenace day, GPU enabled machines (the sciml nodes) were updated to CentOS 7.9 to support CUDA 11.4.2.
* JupyterHub was similarly updated to include CUDA 11.4.2, Tensorflow 2.6.0, and Torch 1.9.1+cu111
If you experience any software incompatibilities in the transition from 7.7 to 7.9, please submit a ServiceNow request. Barring unseen issues, we anticipate incrementally rolling the rest of the farm to CentOS 7.9 over the next month. As we do, Slurm will be updated to reflect the state of worker nodes as described in the next section.
Slurm Specific Notes
Most batch jobs run in the production partition, which is the default. Information about other Slurm capabilities, including GPU and Jupyter Notebook use can be found online here: https://scicomp.jlab.org/docs/farmsDocHome
During this transition from CentOS 7.7 to 7.9 all nodes will remain in the production partition in Slurm. In the unlikely case that you need to limit your jobs to a particular set of nodes, you can submit using the "--constraint" flag and a feature (such as centos77) as shown below. The features will be updated and can be examined using the "sinfo" command. For example, this shows the number of nodes in the production partition and their features:
ifarm1901 ~ $ sinfo --partition=production -o "%8D %12P %f"
NODES PARTITION AVAIL_FEATURES
107 production* farm19,amd,centos77,general
82 production* farm14,centos77,general,xeon
40 production* farm16,centos77,general,xeon
16 production* farm13,centos77,general,xeon
81 production* farm18,centos77,general,xeon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/attachments/20211022/3f0873d7/attachment.html>
More information about the Jlab-scicomp-briefs
mailing list