[Jlab-scicomp-briefs] JLab lost files from hardware failure

Fri Jan 22 20:23:57 EST 2016

Dear JLab Users,

Due to a hardware controller failure on one of our four problematic 2014 fileservers on Monday January 18, a storage target containing just under 30 TB of data and millions of files was corrupted and could not be recovered. On restart with a new RAID controller, the system reported corrupt ZFS pool metadata, making its data permanently inaccessible.  

The disk server is 1 of 24 Lustre fileservers, so files from /cache (file copy on tape), /volatile (large scratch), and /work (user maintained, no auto backups) were affected.  The data lost comprised about 3% of the total 1 PB of data on disk. (Note that no raw data from CEBAF experiments was lost).

Users see "Cannot send after transport endpoint shutdown" errors or files with ??? information when attempting to access lost files, and can use the unlink command to delete them. We are systematically walking through the corrupt target to unlink all of its files, generating a complete list of files unlinked by the sweep. Upon completion, the corrupt file list will be placed at /site/scicomp/lostfiles.txt for user inspection. 

We are in continued contact with the vendor to find and resolve the ongoing issues with these fileservers.  Already the systems have received BIOS and disk controller firmware updates; we are cautiously optimistic that their stability will improve with these updates.

Regards,
Sandy