[Moller] [Jlab-scicomp-briefs] JLab HPC, farm, storage outage - services are being restored
Sandy Philpott
philpott at jlab.org
Wed Feb 24 17:06:50 EST 2016
All,
We have begun restoring systems and services following the Lustre outage. In the end, we did not upgrade our Lustre 2.5.3 system to Intel Lustre during this outage as planned, because the Intel upgrade process was not successful in our Lustre testbed. In short, the required CentOS 6.5 to 6.7 upgrade and Intel Lustre install needs additional work before we can upgrade our production systems.
We will continue to work with Intel to finalize the process, and plan to reschedule the upgrade for a later date.
Regards,
Sandy
----- Original Message -----
From: "Sandy Philpott" <philpott at jlab.org>
To: jlab-scicomp-briefs at jlab.org
Sent: Wednesday, February 24, 2016 11:03:32 AM
Subject: Re: JLab HPC, farm, storage outage Tue Feb 23 (and Feb 24 if needed)
Hello Users,
An update on our Intel Lustre software upgrade...
Intel is still working in our testbed to get Intel's Lustre software installed into our environment. Once the issues are resolved with the CentOS 6.5 to 6.7 update and Lustre upgrade in our testbed, we can then begin the upgrade to our 26 Lustre production servers.
During the quiescence of production Lustre services, we are performing a full backup of the Lustre metadata and extended attributes; that process is at 24 hours and still running.
We will know more later today on the timeline for restoring services, as the issues with the Intel Lustre install are resolved and we are able to upgrade our production systems. At this point, it looks probable that the upgrade will spill over into tomorrow.
Regards,
Sandy
----- Original Message -----
From: "Sandy Philpott" <philpott at jlab.org>
To: jlab-scicomp-briefs at jlab.org
Sent: Wednesday, February 17, 2016 9:40:51 AM
Subject: JLab HPC, farm, storage outage Tue Feb 23 (and Feb 24 if needed)
Dear JLab HPC, Farm, and MSS Users,
We are planning updates to our Lustre 2.5.3 filesystem on Tuesday Feb 23, that will impact the HPC and batch farm clusters and their storage; the Lustre-based /mss, /cache, /work, and /volatile filesystems will be unavailable. File services may be restored at the end of the day Tuesday, or if more time is needed we may continue through Wednesday close-of-business. This Lustre update is part of our continuing effort to improve the stability of the Lustre and ZFS fileservers that have been intermittently problematic since our upgrade from Lustre 1.8 started last May.
Upon restoring Lustre filesystem services, our plans are to
- add the 2 newest 2015 fileservers into production, an additional 0.5 PB into our existing 1.2 PB /lustre filesystem, then later decommission the oldest 2011 servers
- implement a Lustre "work" pool for the farm, to better handle small random files, with each 12-disk set configured as 3 mirrors of 4 striped disks
Please plan accordingly for this scheduled one or two day upgrade of the Lustre filesystem next week.
Regards,
Sandy
Sandra C. Philpott, Operations Manager
High Performance and Scientific Computing
Thomas Jefferson National Accelerator Facility
12000 Jefferson Ave. Ste 3 Newport News, VA 23606
757-269-7152 http://www.jlab.org/hpc
_______________________________________________
Jlab-scicomp-briefs mailing list
Jlab-scicomp-briefs at jlab.org
https://mailman.jlab.org/mailman/listinfo/jlab-scicomp-briefs
More information about the Moller
mailing list