[Halld-offline] Fwd: farm16 nodes into production
Mark Ito
marki at jlab.org
Mon Nov 14 11:05:38 EST 2016
from Sandy Philpott...
-------- Forwarded Message --------
Subject: farm16 nodes into production
Date: Mon, 14 Nov 2016 10:57:34 -0500 (EST)
From: Sandy Philpott <philpott at jlab.org>
To: Jens-Ole Hansen <ole at jlab.org>, Harut Avakian <avakian at jlab.org>,
Brad Sawatzky <brads at jlab.org>, Mark Ito <marki at jlab.org>
CC: scicomp <scicomp at jlab.org>
Hi All,
To follow up on the CentOS 7 farm16 Broadwell batch nodes' configuration ...
They now have a new Lustre client, even newer than the Lustre servers, to address the BUG: issue that was first occuring. They've also had their local disk rebuilt with ext4 for /scratch rather than xfs, since the system and jobs are now sharing this one disk and experienced performance issues with xfs. They are configured for running one job per node to avoid further local disk I/O contention.
Several users have reported successful running on these farm16 nodes, and the nodes are now ready for production services.
Regards,
Sandy
----- Original Message -----
From: "Sandy Philpott" <philpott at jlab.org>
To: "Jens-Ole Hansen" <ole at jlab.org>, "Harut Avakian" <avakian at jlab.org>, "Brad Sawatzky" <brads at jlab.org>, "Mark Ito" <marki at jlab.org>
Cc: "farm" <farm at jlab.org>
Sent: Tuesday, November 1, 2016 3:43:54 PM
Subject: CentOS7 bug on farm16
Hi Computing Coordinators,
I've mentioned it to a few of you, but not all ...
With the first set of test jobs on the CentOS 7.2 farm16 nodes, the nodes are rebooting with
[247133.600902] BUG: unable to handle kernel paging request at 00007ffc1ca80a20
We are looking for more troubleshooting info, so if you have any small sets of test jobs to submit that might help shed insight on when/why this happens, please do. Meanwhile only 1 or 2 users have tried running on these systems yet...
We hope to have an understanding and solution this week ...
Regards,
Sandy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20161114/0d187777/attachment.html>
More information about the Halld-offline
mailing list