[Halld] Fw: Failed drive in gluonraid3

Eugene Chudakov gen at jlab.org
Fri Apr 17 10:33:02 EDT 2020


In case someone has valuable data on gluonraid3 which has not been copied elsewhere - let Sergey Furletov know before Monday.

The device has to been reconfigured - see a mail from Paul Letta. Sergey will give Paul the green light.

Eugene

________________________________
From: Sergey Furletov <furletov at jlab.org>
Sent: Friday, April 17, 2020 10:25 AM
To: David Lawrence <davidl at jlab.org>
Cc: Paul Letta <letta at jlab.org>; Eugene Chudakov <gen at jlab.org>
Subject: Re: Failed drive in gluonraid3

Hi All,

since we are going to wipe /data1 anyway, we are not in hurry to replace the disk in it,
and  because other disks are endangered  I would suggest first asking colleagues to check if there is any other data on these disks.

Sergey


On Apr 17, 2020, at 10:11, David Lawrence <davidl at jlab.org<mailto:davidl at jlab.org>> wrote:

Hi All,

  All of the raw data has been copied to tape so there should be no danger of permanent loss.
I’ll leave the decision on re-wiring to Sergey and Eugene.

Regards,
-David

-------------------------------------------------------------
David Lawrence Ph.D.
Staff Scientist, Thomas Jefferson National Accelerator Facility
Newport News, VA
davidl at jlab.org<mailto:davidl at jlab.org>
(757) 269-5567 W
(757) 746-6697 C


On Apr 17, 2020, at 9:50 AM, Paul Letta <letta at jlab.org<mailto:letta at jlab.org>> wrote:

Hi All.
The disks on gluonraid3 were discontinued.  I have a compatible replacement.   However the raid card is not accepting it due so block size differences.  I do believe I can get this drive into use… but I will have to destroy and recreate the /data1 partition on gluonraid3 to do it.    Also – gluonraid3 has a legacy issue that this might be a good time to correct.  Recall I had to change the SAS cabling in gluonraid4/5 so that each of the 2 raid cards would only see their own drives. (i.e. half the drives are on raid card 1, and the other half of the drives are on raid card 2).   That was not done on gluonraid3.  This is needed because having 2 Raid cards in the same machine that both see all the drives could cause issues, and LSI does not recommend it.  That’s why after I learned this, I rewired gluonraid4/5.   I think I can preserve the files in /data2,3,4 through this.. but maybe not…   So can we ensure everything on gluonraid3 is written to tape so that I could do this?
/data1 needs to be wiped regardless to get the replacement drive into it.

Oh.. gluonraid5 failed a drive yesterday as well.  It’s still under warranty, so a replacement drive will be on the way for it.

Paul


From: Eugene Chudakov <gen at jlab.org<mailto:gen at jlab.org>>
Sent: Friday, April 3, 2020 3:50 PM
To: Paul Letta <letta at jlab.org<mailto:letta at jlab.org>>; Sergey Furletov <furletov at jlab.org<mailto:furletov at jlab.org>>; David Lawrence <davidl at jlab.org<mailto:davidl at jlab.org>>
Cc: Andy Kowalski <kowalski at jlab.org<mailto:kowalski at jlab.org>>; Kelvin Edwards <kelvin at jlab.org<mailto:kelvin at jlab.org>>
Subject: Re: Failed drive in gluonraid3

Hi Paul,

Please use DOPS.PHALLD

Thanks,
Eugene

________________________________
From: Paul Letta <letta at jlab.org<mailto:letta at jlab.org>>
Sent: Friday, April 3, 2020 3:47 PM
To: Sergey Furletov <furletov at jlab.org<mailto:furletov at jlab.org>>; David Lawrence <davidl at jlab.org<mailto:davidl at jlab.org>>; Eugene Chudakov <gen at jlab.org<mailto:gen at jlab.org>>
Cc: Andy Kowalski <kowalski at jlab.org<mailto:kowalski at jlab.org>>; Kelvin Edwards <kelvin at jlab.org<mailto:kelvin at jlab.org>>
Subject: Failed drive in gluonraid3

Gluonraid3 does have a degraded area due to a failed drive.  I do not think gluonraid3 is under warranty anymore.
I'll need to get the serial number off the chassis and call the vendor to be sure.  I'll do that the next time I'm onsite, which may be 1-2 weeks.

Or -- I can just order a replacement through webstock.. it’s a 8TB drive.. $377.    Give me the charge code and I'll order it.

It’s the /data1 area on gluonraid3.   All the files have be copied to tape, so there is no risk of data loss.
Just do not use data1 until its fixed.

I also found and fixed the error that prevented me from getting emails about this from the raid card.  I've checked gluonraid1-5 and make sure email alerts are working.

Paul


-----Original Message-----
From: Sergey Furletov <furletov at jlab.org<mailto:furletov at jlab.org>>
Sent: Friday, April 3, 2020 12:50 PM
To: Paul Letta <letta at jlab.org<mailto:letta at jlab.org>>; David Lawrence <davidl at jlab.org<mailto:davidl at jlab.org>>
Cc: Eugene Chudakov <gen at jlab.org<mailto:gen at jlab.org>>
Subject: Fwd: Re: Audible Alarm


Hi Paul and David,

gluonraid3 /data1  in degraded state:

----------
Virtual Drive: 2 (Target Id: 2)
Name                :data1
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 65.491 TB
Sector Size         : 512
Is VD emulated      : Yes
Parity Size         : 7.276 TB
State               : Degraded
Strip Size          : 256 KB
Number Of Drives    : 10
----------
PD: 3 Information
Enclosure Device ID: 44
Slot Number: 4
Drive's position: DiskGroup: 3, Span: 0, Arm: 3 Enclosure position: 2 Device Id: 12
WWN: 5000CCA23BB79157
Sequence Number: 5
Media Error Count: 1
Other Error Count: 11
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0 PD Type: SAS
-------------------

David, how can we verify that all data is copied to tapes?

Sergey

-------- Forwarded Message --------
Subject:         Re: Audible Alarm
Date:    Fri, 3 Apr 2020 12:12:23 -0400
From:    Timothy Whitlatch <whitey at jlab.org<mailto:whitey at jlab.org>>
To:      Hovanes Egiyan <hovanes at jlab.org<mailto:hovanes at jlab.org>>, Sergey Furletov <furletov at jlab.org<mailto:furletov at jlab.org>>
CC:      Walt Akers <akers at jlab.org<mailto:akers at jlab.org>>, Eugene Chudakov <gen at jlab.org<mailto:gen at jlab.org>>



Thanks Walt. I am forwarding this to people that may know.

Looks like Gluon Raid3

Hovanes or Sergey please see link below. Anything that should be done?

Thanks,

Tim Whitlatch

Hall D Engineer

Jefferson Lab

757-269-5087


------------------------------------------------------------------------
*From:* Walt Akers <akers at jlab.org<mailto:akers at jlab.org>>
*Sent:* Friday, April 3, 2020 11:46 AM
*To:* Timothy Whitlatch <whitey at jlab.org<mailto:whitey at jlab.org>>
*Subject:* Audible Alarm
Hey Tim:

We found what appears to be a disk array alarming in the Hall D Counting House computer enclosure.
Take a look at today's daily report and let me know if there is anything that we need to do.

https://physdiv.jlab.org/2020MedConOps/ConditionReports/20200403-ConditionReport.pdf

Walt


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld/attachments/20200417/ab646c0c/attachment-0001.html>


More information about the Halld mailing list