[Halld-offline] [New Logentry] Follow-up Re: Follow-up Re: Corrupted data

Thu Jan 12 14:55:01 EST 2017

Logentry Text:
--
Upon further investigation by Sergey, he found that the issue was due to the RAID0 configuration used for partition 3 and partition 4. As a reminder we had:

/gluonraid3/data1  zfs
/gluonraid3/data2  zfs
/gluonraid3/data3  software RAID0 + xfs
/gluonraid3/data4  software RAID0 + xfs

The configuration of the RAID partitions used disk names like "/dev/sdaa", "/dev/sdbc", ... It turns out that these names are not always assigned to the same physical disk when the system is rebooted. There exists a directory /dev/disk/by-vdev that contains symbolic links with names like "slot1", "slot2", ... that point to the correct device file assigned to the disk in that physical slot at boot time. When we configured the RAID partitions we used the "/dev/sdXX" names (archived from a previous boot) instead of the "/dev/disk/by-vdev/slotX" names. This caused both zfs and xfs to try using the same physical disk, leading to the corruption.

This was discovered after rebooting gluonraid3 today which again shuffled the device names. As a consequence, the data on both partition 3 and partition 4 is now compromised (and probably on the first 2 zfs partitions). No data was lost since everything important was copied to tape back in December. However, it means that the Fall 2016 data is no longer available in the counting house.

---

This is a plain text email for clients that cannot display HTML.  The full logentry can be found online at https://logbooks.jlab.org/entry/3450335
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20170112/e8e0fe48/attachment.html>