[Halld-offline] Fwd: IBM Tape Library Issue

Tue Dec 15 13:31:34 EST 2020

from Bryan Hess...

-------- Forwarded Message --------
Subject: 	IBM Tape Library Issue
Date: 	Tue, 15 Dec 2020 12:20:20 -0500
From: 	Bryan Hess <bhess at jlab.org>
To: 	Mark Ito <marki at jlab.org>, Nathan Baltzell <baltzell at jlab.org>, 
Brad Sawatzky <brads at jlab.org>, Ole Hansen <ole at jlab.org>, Robert 
Edwards <edwards at jlab.org>
CC: 	Graham Heyes <heyes at jlab.org>, Andy Kowalski <kowalski at jlab.org>, 
Jie Chen <chen at jlab.org>, David Lawrence <davidl at jlab.org>

I wanted update you about a tape library issue that we are working on. 
During the last run we switched to writing LTO8 media, which gave us an 
increase in both capacity and throughput. Now, reading back those tapes, 
we are seeing a higher than usual I/O error rate. We have an open case 
with IBM, and yesterday I escalated it for a second time, so that it is 
now a "Critical Situation" in IBM's parlance. This gets us support from 
the tape drive engineers at the Factory.

The problem has been typically observed in CLAS12 and GlueX workflows 
where a batch of farm jobs stall because the tape is unavailable. Tapes 
are automatically disabled by Jasmine when they have unrecoverable I/O 
errors, so a tape with an unreadable segment will cause files after that 
position on the tape to queue until the tape is put through a manual 
recovery/verify procedure.

Current evidence suggests that we may have received a batch of 
problematic tapes. The good news is that since we use IBM media, IBM is 
on the hook to fix the problem under our support agreement, whether it 
is a tape or a drive problem.

Until we have full confidence that the issue is resolved, we are taking 
some measures to curb the impact:

  * We are initiating a verify of files that may have been written to
    problematic media
  * We are falling back to writing M8 media while IBM does analysis on
    the read-back errors
  * We are investigating the possibility of writing temporary duplicates
    of all files as a hedge. This will require some coding to implement.
  * IBM is replacing a significant number of the LTO8 tape drives, based
    on error accounting, to rule out a bad drive as the cause.

I expect that the trajectory of this plan will change as we receive more 
information from IBM, but I wanted to give you a snapshot of where we 
are rather than merely updating service now tickets with "we have an 
open case with IBM"

Please let me know if you have any questions.

--Bryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20201215/8e5dd631/attachment.html>