[Ace] Network timeouts being reported from various applications -- feels ominous.

Theo Larrieu theo at jlab.org
Tue Sep 7 13:47:47 EDT 2021


The timeouts being reported via cavitybypass are wider than just iocnl4

this morning to iocnl2:

https://logbooks.jlab.org/entry/3901900

Over the weekend the timeouts happened for both iocnl4 and iocsl4

https://logbooks.jlab.org/entry/3900887
https://logbooks.jlab.org/entry/3900743


From: Brad Cumbia <cumbia at jlab.org>
Sent: Tuesday, September 7, 2021 12:12 PM
To: Scott Higgins <higgins at jlab.org>; Gary Croke <gcroke at jlab.org>; Adam Carpenter <adamc at jlab.org>; Michele Joyce <erb at jlab.org>; Omar Garza <garza at jlab.org>; Theo Larrieu <theo at jlab.org>; ace at jlab.org; controls_dept at jlab.org
Subject: Re: Network timeouts being reported from various applications -- feels ominous.

Just to start a list of the reported problems, here is what I could gather from the emails.
iochallmag - myrestore caput issue
iocnl4 - cavity bypass failure
Hall B line - unkown fopt problem
NL05 rf - myrestore issue - new zone new rf type hardware, embedded ioc's with soft ioc.

Don't see any network related errors/issues/utilization with any of the associated network ports or network as a whole.  I will continue to keep an eye on things.

Please reply if there is anything I may have missed.
--
Brad Cumbia

Accelerator Network Engineer
Accelerator SysAdmin Group

Thomas Jefferson National Accelerator Facility
12000 Jefferson Avenue
Newport News, Virginia 23606
Phone (757)269-5839

From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Scott Higgins <higgins at jlab.org<mailto:higgins at jlab.org>>
Date: Tuesday, September 7, 2021 at 11:24 AM
To: Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>, Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>, Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>, Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>, Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>, ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>, controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.
Had an issue on Friday Sept. 3 where I rebooted iocsoftnl05rf for a change and myRestore restored from Aug 19th instead
of Sept 3 right before the reboot. This prevented OPS from restoring beam.
Worked through the issue with Theo and found that the archiver stopped archiving some EPICS PV's that were part
of the active Restore Set for iocsoftnl05rf. These were random signals from the restore set with no noticeable pattern. The archiver
has an algorithm were if it thinks a PV has gone away after some period of time it stops archiving it. However these PV's
never went away. This was the first time I have seen this happen.


________________________________
From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>
Sent: Tuesday, September 7, 2021 11:13 AM
To: Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>; Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.

Here's one from Friday that probably falls in this category.  Failed on some caputs, although channels were available:
https://logbooks.jlab.org/entry/3900716

Another attempt shortly after succeeded:
https://logbooks.jlab.org/entry/3900725

________________________________
From: Adam Carpenter <adamc at jlab.org<mailto:adamc at jlab.org>>
Sent: Tuesday, September 7, 2021 11:06 AM
To: Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>; Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: Network timeouts being reported from various applications -- feels ominous.

Omar,

Do you remember which IOCs these were?  PC104s or something else?  Was it part of a mass reboot or just a single IOC?

Thanks,
Adam


Adam Carpenter
Accelerator Operations Software Department
Thomas Jefferson National Accelerator Facility
________________________________
From: Ace <ace-bounces at jlab.org<mailto:ace-bounces at jlab.org>> on behalf of Michele Joyce <erb at jlab.org<mailto:erb at jlab.org>>
Sent: Tuesday, September 7, 2021 10:25 AM
To: Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>; Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>; Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Ace] Network timeouts being reported from various applications -- feels ominous.


Is anything machine dependent?





From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Gary Croke <gcroke at jlab.org<mailto:gcroke at jlab.org>>
Date: Sunday, September 5, 2021 at 11:09 AM
To: Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>, Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>, ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>, controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.

What's the suspicion?  Has there been a big uptick in network traffic recently, or have we gradually been approaching some limit?  Or just some failing hardware somewhere?



________________________________

From: Controls_dept <controls_dept-bounces at jlab.org<mailto:controls_dept-bounces at jlab.org>> on behalf of Omar Garza <garza at jlab.org<mailto:garza at jlab.org>>
Sent: Saturday, September 4, 2021 7:58 PM
To: Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>; ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: Re: [Controls_dept] Network timeouts being reported from various applications -- feels ominous.



This nay also be related

In the past weeks, we have also performed hard reboots on iocs with  multiple tries failing to complete network connections  .

2-3 hard reboots to  succeed.

Omar



Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>

________________________________

From: Ace <ace-bounces at jlab.org<mailto:ace-bounces at jlab.org>> on behalf of Theo Larrieu <theo at jlab.org<mailto:theo at jlab.org>>
Sent: Friday, September 3, 2021 5:27:32 PM
To: ace at jlab.org<mailto:ace at jlab.org> <ace at jlab.org<mailto:ace at jlab.org>>; controls_dept at jlab.org<mailto:controls_dept at jlab.org> <controls_dept at jlab.org<mailto:controls_dept at jlab.org>>
Subject: [Ace] Network timeouts being reported from various applications -- feels ominous.





Observing that in the past 24 hours there have been several seemingly unrelated applications having trouble making connections to EPICS signals.



Cavity History

https://logbooks.jlab.org/entry/3900743



FOPT

https://logbooks.jlab.org/entry/3900317



myRestore

https://logbooks.jlab.org/entry/3900705



It might be wise to start looking for signs of distress in name server, softIOC hosts, network switches, etc.



-Theo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/ace/attachments/20210907/e6680540/attachment-0001.html>


More information about the Ace mailing list