[Ace] Network timeouts being reported from various applications -- feels ominous.
Adam Carpenter
adamc at jlab.org
Wed Sep 15 10:03:01 EDT 2021
Anthony and I cleaned up some issues in opsbat3 around the time of the last email flurry. If these problems have since resolved it may have been the source.
Get Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Brad Cumbia <cumbia at jlab.org>
Sent: Tuesday, September 14, 2021 5:00:11 PM
To: Gary Croke <gcroke at jlab.org>; Theo Larrieu <theo at jlab.org>; Scott Higgins <higgins at jlab.org>; Adam Carpenter <adamc at jlab.org>; Michele Joyce <erb at jlab.org>; Omar Garza <garza at jlab.org>; ace at jlab.org <ace at jlab.org>; controls_dept at jlab.org <controls_dept at jlab.org>
Subject: Re: Network timeouts being reported from various applications -- feels ominous.
Has anyone noticed or stumbled upon anymore of these timeouts or have they resolved themselves? I have been periodically check the Elog but have not seen any significant entries.
--
Brad Cumbia
Accelerator Network Engineer
Accelerator SysAdmin Group
Thomas Jefferson National Accelerator Facility
12000 Jefferson Avenue
Newport News, Virginia 23606
Phone (757)269-5839
From: Gary Croke <gcroke at jlab.org>
Date: Tuesday, September 7, 2021 at 3:56 PM
To: Theo Larrieu <theo at jlab.org>, Brad Cumbia <cumbia at jlab.org>, Scott Higgins <higgins at jlab.org>, Adam Carpenter <adamc at jlab.org>, Michele Joyce <erb at jlab.org>, Omar Garza <garza at jlab.org>, ace at jlab.org <ace at jlab.org>, controls_dept at jlab.org <controls_dept at jlab.org>
Subject: Re: Network timeouts being reported from various applications -- feels ominous.
Not sure if this is related, but iocnl13diag went unresponsive around 13:47 today and had to be rebooted, here are some messages from the console log:
<Sep/07 13:46:19>vxwrks at iocnl13diag -> dbCa:exceptionCallback stat "Virtual circuit unresponsive" channel "unknown" context "opsbat3.acc.jlab.org:34694"
<Sep/07 13:46:19> nativeType DBR_invalid requestType DBR_invalid nativeCount 0 requestCount 0 noReadAccess noWriteAccess
<Sep/07 13:46:31>DB CA Link Exception: "Virtual circuit unresponsive", context "iocin1b.acc.jlab.org:5064"
<Sep/07 13:46:33>DB CA Link Exception: "Virtual circuit unresponsive", context "opsbat3.acc.jlab.org:5064"
<Sep/07 13:47:46>CAS: TCP send to "129.57.236.210:43608" failed because "errno = 0x43"
<Sep/07 13:47:46>rsrv: socket shutdown error was errno = 0x43
<Sep/07 13:51:02>NTPTimeSync: NTP requests failing - errno = 0x710003
________________________________
From: Theo Larrieu <theo at jlab.org>
Sent: Tuesday, September 7, 2021 1:47 PM
To: Brad Cumbia <cumbia at jlab.org>; Scott Higgins <higgins at jlab.org>; Gary Croke <gcroke at jlab.org>; Adam Carpenter <adamc at jlab.org>; Michele Joyce <erb at jlab.org>; Omar Garza <garza at jlab.org>; ace at jlab.org <ace at jlab.org>; controls_dept at jlab.org <controls_dept at jlab.org>
Subject: RE: Network timeouts being reported from various applications -- feels ominous.
The timeouts being reported via cavitybypass are wider than just iocnl4
this morning to iocnl2:
https://logbooks.jlab.org/entry/3901900
Over the weekend the timeouts happened for both iocnl4 and iocsl4
https://logbooks.jlab.org/entry/3900887
https://logbooks.jlab.org/entry/3900743
From: Brad Cumbia <cumbia at jlab.org>
Sent: Tuesday, September 7, 2021 12:12 PM
To: Scott Higgins <higgins at jlab.org>; Gary Croke <gcroke at jlab.org>; Adam Carpenter <adamc at jlab.org>; Michele Joyce <erb at jlab.org>; Omar Garza <garza at jlab.org>; Theo Larrieu <theo at jlab.org>; ace at jlab.org; controls_dept at jlab.org
Subject: Re: Network timeouts being reported from various applications -- feels ominous.
Just to start a list of the reported problems, here is what I could gather from the emails.
iochallmag - myrestore caput issue
iocnl4 - cavity bypass failure
Hall B line - unkown fopt problem
NL05 rf - myrestore issue - new zone new rf type hardware, embedded ioc's with soft ioc.
Don't see any network related errors/issues/utilization with any of the associated network ports or network as a whole. I will continue to keep an eye on things.
Please reply if there is anything I may have missed.
--
Brad Cumbia
Accelerator Network Engineer
Accelerator SysAdmin Group
Thomas Jefferson National Accelerator Facility
12000 Jefferson Avenue
Newport News, Virginia 23606
Phone (757)269-5839
From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Scott Higgins <higgins at jlab.org<mailto:higgins at jlab.org>>
Date: Tuesday, September 7, 2021 at 11:24 AM
To: Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>, Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>, Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>, Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>, Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>, ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>, controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.
Had an issue on Friday Sept. 3 where I rebooted iocsoftnl05rf for a change and myRestore restored from Aug 19th instead
of Sept 3 right before the reboot. This prevented OPS from restoring beam.
Worked through the issue with Theo and found that the archiver stopped archiving some EPICS PV's that were part
of the active Restore Set for iocsoftnl05rf. These were random signals from the restore set with no noticeable pattern. The archiver
has an algorithm were if it thinks a PV has gone away after some period of time it stops archiving it. However these PV's
never went away. This was the first time I have seen this happen.
________________________________
From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>
Sent: Tuesday, September 7, 2021 11:13 AM
To: Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>; Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.
Here's one from Friday that probably falls in this category. Failed on some caputs, although channels were available:
https://logbooks.jlab.org/entry/3900716
Another attempt shortly after succeeded:
https://logbooks.jlab.org/entry/3900725
________________________________
From: Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>
Sent: Tuesday, September 7, 2021 11:06 AM
To: Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>; Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: Network timeouts being reported from various applications -- feels ominous.
Omar,
Do you remember which IOCs these were? PC104s or something else? Was it part of a mass reboot or just a single IOC?
Thanks,
Adam
Adam Carpenter
Accelerator Operations Software Department
Thomas Jefferson National Accelerator Facility
________________________________
From: Ace <ace-bounces at jlab.org<mailto:ace-bounces at jlab.org>> on behalf of Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>
Sent: Tuesday, September 7, 2021 10:25 AM
To: Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Ace] Network timeouts being reported from various applications -- feels ominous.
Is anything machine dependent?
From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>
Date: Sunday, September 5, 2021 at 11:09 AM
To: Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>, Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>, ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>, controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.
What's the suspicion? Has there been a big uptick in network traffic recently, or have we gradually been approaching some limit? Or just some failing hardware somewhere?
________________________________
From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>
Sent: Saturday, September 4, 2021 7:58 PM
To: Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.
This nay also be related
In the past weeks, we have also performed hard reboots on iocs with multiple tries failing to complete network connections .
2-3 hard reboots to succeed.
Omar
Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Ace <ace-bounces at jlab.org<mailto:ace-bounces at jlab.org>> on behalf of Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>
Sent: Friday, September 3, 2021 5:27:32 PM
To: ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: [Ace] Network timeouts being reported from various applications -- feels ominous.
Observing that in the past 24 hours there have been several seemingly unrelated applications having trouble making connections to EPICS signals.
Cavity History
https://logbooks.jlab.org/entry/3900743
FOPT
https://logbooks.jlab.org/entry/3900317
myRestore
https://logbooks.jlab.org/entry/3900705
It might be wise to start looking for signs of distress in name server, softIOC hosts, network switches, etc.
-Theo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/ace/attachments/20210915/285e6305/attachment-0001.html>
More information about the Ace
mailing list