[Hps] [EXTERNAL] Fwd: major SLAC network and system service disruption June 11 9P - June 12 2A
Omar Moreno
email at omarmoreno.net
Fri Jun 12 11:44:53 EDT 2020
FYI.
---------- Forwarded message ---------
From: Adesanya, Adeyemi <
0000007cceebf6ba-dmarc-request at listserv.slac.stanford.edu>
Date: Fri, Jun 12, 2020 at 6:48 AM
Subject: major SLAC network and system service disruption June 11 9P - June
12 2A
To: opsproblist at glast2.stanford.edu <opsproblist at glast2.stanford.edu>,
hep_computing_tf <hep_computing_tf at slac.stanford.edu>, White, Greg <
greg at slac.stanford.edu>, Perazzo, Amedeo <perazzo at slac.stanford.edu>
Cc: Foster, Mark <fosterm at slac.stanford.edu>, Purcell, Kevin <
kpurcell at slac.stanford.edu>, Nakata, Lance <lnakata at slac.stanford.edu>
Summary: Thursday June 11, there was a major disruption of a portion of the
SLAC network and some system services. This disruption included limited or
sporadic access to Domain Name System (DNS), and timeout or lack of access
to the AFS file system servers. Many other services were also impacted, as
there is heavy reliance upon both DNS and AFS. Services began to recover
approximately 2:10AM Friday, June 12.
Main impact period: 9:00P Thu 6/11 – 2:10A Fri 6/12
Initial Assessment: During scheduled networking changes (CHG0033649), an
anomaly developed in a portion of key network equipment that caused
intermittent response to some DNS lookups. The change related to the
anomaly was reverted, however problems with some services persisted,
including an impact to critical AFS servers. The nature and scope of the
services impacted took several hours to diagnose and a configuration work
around established to restore service.
Findings: The scheduled changes did not go as tested and planned. The
changes were intended to reduce dependencies on aging network hardware.
Part way through the changes, a part of the system stopped passing some
traffic, but continued to process other traffic as expected. Multiple
groups were required to diagnose and resolve the problem. The effort was
complicated by some diagnosis requiring personnel onsite. The requirement
for onsite presence was partly due to how the systems failed, making remote
access impossible for key support individuals.
Actions: Once the initial anomalies were observed, alerts were sent to
several key support folks (via Slack and Skype) to assist in determining
impact scope. After the planned changes were reverted, and the service
impacts persisted, escalations were made to broader set of support teams,
and the full network engineering team began extensive investigation and
troubleshooting. Diagnostic input from the teams was used to narrow the
focus of the problem. This focus led to intermittent behavior of one router
component that was not properly handling specific types of DNS services. A
configuration change was made to work around this faulty behavior.
Follow-on actions will include collecting lessons learned and steps that
can be taken to reduce the chance of a similar incident, as well as reduce
the response and recovery time. Special focus will be made on assurance
that support can be delivered remotely, even when some key services are
impacted.
While many people helped, I’d like to particularly recognize Lance Nakata
for traveling to SLAC to provide onsite help in diagnosis and recovery
steps.
Please let me know if you have any questions.
————
Yemi
650-224-1874
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the HEP_COMPUTING_TF list, click the following link:
https://urldefense.proofpoint.com/v2/url?u=https-3A__listserv.slac.stanford.edu_cgi-2Dbin_wa-3FSUBED1-3DHEP-5FCOMPUTING-5FTF-26A-3D1&d=DwIFaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=ErLkbYGWd8oSAZ70ywHxhA&m=vWoDsOkYD-qj4phC-l5jLPQm7mfwchUrRY8vRVhO7No&s=zk_IDgIc1EwtOf-6SCThqgXd7Aw2iSd1wnNgt1G_pcs&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/hps/attachments/20200612/b5db3464/attachment.html>
More information about the Hps
mailing list