[Ace] Network timeouts being reported from various applications -- feels ominous.

Adam Carpenter adamc at jlab.org
Wed Sep 15 10:03:01 EDT 2021


Anthony and I cleaned up some issues in opsbat3 around the time of the last email flurry.  If these problems have since resolved it may have been the source.

Get Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Brad Cumbia <cumbia at jlab.org>
Sent: Tuesday, September 14, 2021 5:00:11 PM
To: Gary Croke <gcroke at jlab.org>; Theo Larrieu <theo at jlab.org>; Scott Higgins <higgins at jlab.org>; Adam Carpenter <adamc at jlab.org>; Michele Joyce <erb at jlab.org>; Omar Garza <garza at jlab.org>; ace at jlab.org <ace at jlab.org>; controls_dept at jlab.org <controls_dept at jlab.org>
Subject: Re: Network timeouts being reported from various applications -- feels ominous.


Has anyone noticed or stumbled upon anymore of these timeouts or have they resolved themselves?  I have been periodically check the Elog but have not seen any significant entries.



--

Brad Cumbia



Accelerator Network Engineer

Accelerator SysAdmin Group



Thomas Jefferson National Accelerator Facility

12000 Jefferson Avenue

Newport News, Virginia 23606

Phone (757)269-5839



From: Gary Croke <gcroke at jlab.org>
Date: Tuesday, September 7, 2021 at 3:56 PM
To: Theo Larrieu <theo at jlab.org>, Brad Cumbia <cumbia at jlab.org>, Scott Higgins <higgins at jlab.org>, Adam Carpenter <adamc at jlab.org>, Michele Joyce <erb at jlab.org>, Omar Garza <garza at jlab.org>, ace at jlab.org <ace at jlab.org>, controls_dept at jlab.org <controls_dept at jlab.org>
Subject: Re: Network timeouts being reported from various applications -- feels ominous.

Not sure if this is related, but iocnl13diag went unresponsive around 13:47 today and had to be rebooted, here are some messages from the console log:



<Sep/07 13:46:19>vxwrks at iocnl13diag -> dbCa:exceptionCallback stat "Virtual circuit unresponsive" channel "unknown" context "opsbat3.acc.jlab.org:34694"

<Sep/07 13:46:19> nativeType DBR_invalid requestType DBR_invalid nativeCount 0 requestCount 0 noReadAccess noWriteAccess

<Sep/07 13:46:31>DB CA Link Exception: "Virtual circuit unresponsive", context "iocin1b.acc.jlab.org:5064"

<Sep/07 13:46:33>DB CA Link Exception: "Virtual circuit unresponsive", context "opsbat3.acc.jlab.org:5064"

<Sep/07 13:47:46>CAS: TCP send to "129.57.236.210:43608" failed because "errno = 0x43"

<Sep/07 13:47:46>rsrv: socket shutdown error was errno = 0x43

<Sep/07 13:51:02>NTPTimeSync: NTP requests failing - errno = 0x710003





________________________________

From: Theo Larrieu <theo at jlab.org>
Sent: Tuesday, September 7, 2021 1:47 PM
To: Brad Cumbia <cumbia at jlab.org>; Scott Higgins <higgins at jlab.org>; Gary Croke <gcroke at jlab.org>; Adam Carpenter <adamc at jlab.org>; Michele Joyce <erb at jlab.org>; Omar Garza <garza at jlab.org>; ace at jlab.org <ace at jlab.org>; controls_dept at jlab.org <controls_dept at jlab.org>
Subject: RE: Network timeouts being reported from various applications -- feels ominous.



The timeouts being reported via cavitybypass are wider than just iocnl4



this morning to iocnl2:



https://logbooks.jlab.org/entry/3901900



Over the weekend the timeouts happened for both iocnl4 and iocsl4



https://logbooks.jlab.org/entry/3900887

https://logbooks.jlab.org/entry/3900743





From: Brad Cumbia <cumbia at jlab.org>
Sent: Tuesday, September 7, 2021 12:12 PM
To: Scott Higgins <higgins at jlab.org>; Gary Croke <gcroke at jlab.org>; Adam Carpenter <adamc at jlab.org>; Michele Joyce <erb at jlab.org>; Omar Garza <garza at jlab.org>; Theo Larrieu <theo at jlab.org>; ace at jlab.org; controls_dept at jlab.org
Subject: Re: Network timeouts being reported from various applications -- feels ominous.



Just to start a list of the reported problems, here is what I could gather from the emails.

iochallmag - myrestore caput issue

iocnl4 - cavity bypass failure

Hall B line - unkown fopt problem

NL05 rf - myrestore issue - new zone new rf type hardware, embedded ioc's with soft ioc.



Don't see any network related errors/issues/utilization with any of the associated network ports or network as a whole.  I will continue to keep an eye on things.



Please reply if there is anything I may have missed.

--

Brad Cumbia



Accelerator Network Engineer

Accelerator SysAdmin Group



Thomas Jefferson National Accelerator Facility

12000 Jefferson Avenue

Newport News, Virginia 23606

Phone (757)269-5839



From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Scott Higgins <higgins at jlab.org<mailto:higgins at jlab.org>>
Date: Tuesday, September 7, 2021 at 11:24 AM
To: Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>, Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>, Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>, Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>, Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>, ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>, controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.

Had an issue on Friday Sept. 3 where I rebooted iocsoftnl05rf for a change and myRestore restored from Aug 19th instead

of Sept 3 right before the reboot. This prevented OPS from restoring beam.

Worked through the issue with Theo and found that the archiver stopped archiving some EPICS PV's that were part

of the active Restore Set for iocsoftnl05rf. These were random signals from the restore set with no noticeable pattern. The archiver

has an algorithm were if it thinks a PV has gone away after some period of time it stops archiving it. However these PV's

never went away. This was the first time I have seen this happen.





________________________________

From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>
Sent: Tuesday, September 7, 2021 11:13 AM
To: Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>; Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.



Here's one from Friday that probably falls in this category.  Failed on some caputs, although channels were available:

https://logbooks.jlab.org/entry/3900716



Another attempt shortly after succeeded:

https://logbooks.jlab.org/entry/3900725



________________________________

From: Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>
Sent: Tuesday, September 7, 2021 11:06 AM
To: Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>; Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: Network timeouts being reported from various applications -- feels ominous.



Omar,



Do you remember which IOCs these were?  PC104s or something else?  Was it part of a mass reboot or just a single IOC?



Thanks,

Adam



Adam Carpenter
Accelerator Operations Software Department
Thomas Jefferson National Accelerator Facility

________________________________

From: Ace <ace-bounces at jlab.org<mailto:ace-bounces at jlab.org>> on behalf of Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>
Sent: Tuesday, September 7, 2021 10:25 AM
To: Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Ace] Network timeouts being reported from various applications -- feels ominous.



Is anything machine dependent?





From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>
Date: Sunday, September 5, 2021 at 11:09 AM
To: Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>, Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>, ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>, controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.

What's the suspicion?  Has there been a big uptick in network traffic recently, or have we gradually been approaching some limit?  Or just some failing hardware somewhere?



________________________________

From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>
Sent: Saturday, September 4, 2021 7:58 PM
To: Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.



This nay also be related

In the past weeks, we have also performed hard reboots on iocs with  multiple tries failing to complete network connections  .

2-3 hard reboots to  succeed.

Omar



Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>

________________________________

From: Ace <ace-bounces at jlab.org<mailto:ace-bounces at jlab.org>> on behalf of Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>
Sent: Friday, September 3, 2021 5:27:32 PM
To: ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: [Ace] Network timeouts being reported from various applications -- feels ominous.





Observing that in the past 24 hours there have been several seemingly unrelated applications having trouble making connections to EPICS signals.



Cavity History

https://logbooks.jlab.org/entry/3900743



FOPT

https://logbooks.jlab.org/entry/3900317



myRestore

https://logbooks.jlab.org/entry/3900705



It might be wise to start looking for signs of distress in name server, softIOC hosts, network switches, etc.



-Theo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/ace/attachments/20210915/285e6305/attachment-0001.html>


More information about the Ace mailing list