[Ace] Network timeouts being reported from various applications -- feels ominous.

Brad Cumbia cumbia at jlab.org
Tue Sep 7 16:05:27 EDT 2021


129.57.236.210:43608 is opsmyafb07

--
Brad Cumbia

Accelerator Network Engineer
Accelerator SysAdmin Group

Thomas Jefferson National Accelerator Facility
12000 Jefferson Avenue
Newport News, Virginia 23606
Phone (757)269-5839

From: Gary Croke <gcroke at jlab.org>
Date: Tuesday, September 7, 2021 at 3:56 PM
To: Theo Larrieu <theo at jlab.org>, Brad Cumbia <cumbia at jlab.org>, Scott Higgins <higgins at jlab.org>, Adam Carpenter <adamc at jlab.org>, Michele Joyce <erb at jlab.org>, Omar Garza <garza at jlab.org>, ace at jlab.org <ace at jlab.org>, controls_dept at jlab.org <controls_dept at jlab.org>
Subject: Re: Network timeouts being reported from various applications -- feels ominous.
Not sure if this is related, but iocnl13diag went unresponsive around 13:47 today and had to be rebooted, here are some messages from the console log:

<Sep/07 13:46:19>vxwrks at iocnl13diag -> dbCa:exceptionCallback stat "Virtual circuit unresponsive" channel "unknown" context "opsbat3.acc.jlab.org:34694"
<Sep/07 13:46:19> nativeType DBR_invalid requestType DBR_invalid nativeCount 0 requestCount 0 noReadAccess noWriteAccess
<Sep/07 13:46:31>DB CA Link Exception: "Virtual circuit unresponsive", context "iocin1b.acc.jlab.org:5064"
<Sep/07 13:46:33>DB CA Link Exception: "Virtual circuit unresponsive", context "opsbat3.acc.jlab.org:5064"
<Sep/07 13:47:46>CAS: TCP send to "129.57.236.210:43608" failed because "errno = 0x43"
<Sep/07 13:47:46>rsrv: socket shutdown error was errno = 0x43
<Sep/07 13:51:02>NTPTimeSync: NTP requests failing - errno = 0x710003


________________________________
From: Theo Larrieu <theo at jlab.org>
Sent: Tuesday, September 7, 2021 1:47 PM
To: Brad Cumbia <cumbia at jlab.org>; Scott Higgins <higgins at jlab.org>; Gary Croke <gcroke at jlab.org>; Adam Carpenter <adamc at jlab.org>; Michele Joyce <erb at jlab.org>; Omar Garza <garza at jlab.org>; ace at jlab.org <ace at jlab.org>; controls_dept at jlab.org <controls_dept at jlab.org>
Subject: RE: Network timeouts being reported from various applications -- feels ominous.


The timeouts being reported via cavitybypass are wider than just iocnl4



this morning to iocnl2:



https://logbooks.jlab.org/entry/3901900



Over the weekend the timeouts happened for both iocnl4 and iocsl4



https://logbooks.jlab.org/entry/3900887

https://logbooks.jlab.org/entry/3900743





From: Brad Cumbia <cumbia at jlab.org>
Sent: Tuesday, September 7, 2021 12:12 PM
To: Scott Higgins <higgins at jlab.org>; Gary Croke <gcroke at jlab.org>; Adam Carpenter <adamc at jlab.org>; Michele Joyce <erb at jlab.org>; Omar Garza <garza at jlab.org>; Theo Larrieu <theo at jlab.org>; ace at jlab.org; controls_dept at jlab.org
Subject: Re: Network timeouts being reported from various applications -- feels ominous.



Just to start a list of the reported problems, here is what I could gather from the emails.

iochallmag - myrestore caput issue

iocnl4 - cavity bypass failure

Hall B line - unkown fopt problem

NL05 rf - myrestore issue - new zone new rf type hardware, embedded ioc's with soft ioc.



Don't see any network related errors/issues/utilization with any of the associated network ports or network as a whole.  I will continue to keep an eye on things.



Please reply if there is anything I may have missed.

--

Brad Cumbia



Accelerator Network Engineer

Accelerator SysAdmin Group



Thomas Jefferson National Accelerator Facility

12000 Jefferson Avenue

Newport News, Virginia 23606

Phone (757)269-5839



From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Scott Higgins <higgins at jlab.org<mailto:higgins at jlab.org>>
Date: Tuesday, September 7, 2021 at 11:24 AM
To: Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>, Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>, Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>, Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>, Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>, ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>, controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.

Had an issue on Friday Sept. 3 where I rebooted iocsoftnl05rf for a change and myRestore restored from Aug 19th instead

of Sept 3 right before the reboot. This prevented OPS from restoring beam.

Worked through the issue with Theo and found that the archiver stopped archiving some EPICS PV's that were part

of the active Restore Set for iocsoftnl05rf. These were random signals from the restore set with no noticeable pattern. The archiver

has an algorithm were if it thinks a PV has gone away after some period of time it stops archiving it. However these PV's

never went away. This was the first time I have seen this happen.





________________________________

From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>
Sent: Tuesday, September 7, 2021 11:13 AM
To: Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>; Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.



Here's one from Friday that probably falls in this category.  Failed on some caputs, although channels were available:

https://logbooks.jlab.org/entry/3900716



Another attempt shortly after succeeded:

https://logbooks.jlab.org/entry/3900725



________________________________

From: Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>
Sent: Tuesday, September 7, 2021 11:06 AM
To: Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>; Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: Network timeouts being reported from various applications -- feels ominous.



Omar,



Do you remember which IOCs these were?  PC104s or something else?  Was it part of a mass reboot or just a single IOC?



Thanks,

Adam



Adam Carpenter
Accelerator Operations Software Department
Thomas Jefferson National Accelerator Facility

________________________________

From: Ace <ace-bounces at jlab.org<mailto:ace-bounces at jlab.org>> on behalf of Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>
Sent: Tuesday, September 7, 2021 10:25 AM
To: Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Ace] Network timeouts being reported from various applications -- feels ominous.



Is anything machine dependent?





From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>
Date: Sunday, September 5, 2021 at 11:09 AM
To: Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>, Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>, ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>, controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.

What's the suspicion?  Has there been a big uptick in network traffic recently, or have we gradually been approaching some limit?  Or just some failing hardware somewhere?



________________________________

From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>
Sent: Saturday, September 4, 2021 7:58 PM
To: Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.



This nay also be related

In the past weeks, we have also performed hard reboots on iocs with  multiple tries failing to complete network connections  .

2-3 hard reboots to  succeed.

Omar



Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>

________________________________

From: Ace <ace-bounces at jlab.org<mailto:ace-bounces at jlab.org>> on behalf of Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>
Sent: Friday, September 3, 2021 5:27:32 PM
To: ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: [Ace] Network timeouts being reported from various applications -- feels ominous.





Observing that in the past 24 hours there have been several seemingly unrelated applications having trouble making connections to EPICS signals.



Cavity History

https://logbooks.jlab.org/entry/3900743



FOPT

https://logbooks.jlab.org/entry/3900317



myRestore

https://logbooks.jlab.org/entry/3900705



It might be wise to start looking for signs of distress in name server, softIOC hosts, network switches, etc.



-Theo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/ace/attachments/20210907/a94b77a4/attachment-0001.html>


More information about the Ace mailing list