[HydraTeam] Fw: disk IO issues on epscidb-a
Thomas Britton
tbritton at jlab.org
Wed Jan 24 09:03:53 EST 2024
Thomas Britton
Staff Scientist in Scientific Computing
Jefferson Lab
x7624
________________________________
From: Marty Wise <wise at jlab.org>
Sent: Wednesday, January 24, 2024 8:36 AM
To: Thomas Britton <tbritton at jlab.org>
Cc: Kelvin Edwards <kelvin at jlab.org>; Paul Letta <letta at jlab.org>; Myung Bang <bangdm at jlab.org>
Subject: disk IO issues on epscidb-a
Thomas,
I may have found the source of the disk IO issues I’ve seen on epscidb-a. Looking at the ganglia graphs for it over the last couple of months, it shows a large amount of IOWait (usually kind of orange/salmon colored on the graphs). This is generally something we don’t see much if any at all, so this is definitely unusual, and could produce a variety of symptoms on the system.
Over the past several months, we have begun deploying a cyber security tool (required by the feds) called CrowdStrike/Falcon. For most linux systems, I believe it is deployed only in “monitoring” mode – i.e. it’s not configured to block anything or take any action, just to alert if something is wrong. I noticed there were a lot of these processes running on the system. I turned it off and waited overnight to see how the disk IO/IOWait status changed.
[cid:image001.png at 01DA4EA0.79F4A120]
As you can see, Wait state stats dropped dramatically about the time I disabled CrowdStrike. Now.. maybe something else happened around that time, so this isn’t conclusive yet, but looks very suspicious. Are you aware of any significant change late yesterday afternoon that might account for the difference?
So, I am continuing to monitor the system. Please let me know if you experience any problems or notice anything unusual.
BTW – we are meeting with the CrowdStrike reps soon and will discuss this apparent problem with them. I should say, that it’s entirely possible that this issue is related to some misconfiguration or system peculiarity and not a problem with CrowdStrike itself (I have not noticed similar issues elsewhere)… so hopefully we can work out the problem and re-enable CrowdStrike on the system. But, I will leave it disabled for now.
Marty Wise
JLab CST/CNI
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/hydrateam/attachments/20240124/57b9374a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 119552 bytes
Desc: image001.png
URL: <https://mailman.jlab.org/pipermail/hydrateam/attachments/20240124/57b9374a/attachment-0001.png>
More information about the Hydrateam
mailing list