[Dsg-rich] [Rich] FPGA temp alarm

Wed Oct 17 13:41:35 EDT 2018

Hi All - I guess I dropped off the dsg-rich mailing list at some point, but Valery figured this out and got me sorted out.

FPGA temperature fluctuations known causes:
  1) Server in rich4 that reports values to EPICS is recompiled while server is running (like changing a tire driving on the highway) - we believe this caused the ~120C reported temperature that triggered the software interlock threshold on Monday
  2) RICH LV is cycled - can result in 0C temperature or large negative temperature reported
  3) CODA/DAQ is restarted - same symptom as (2)
  4) Fiber link to SSP goes down - same symptom as (2)

RICH FPGA Fiber link to SSP goes down:
  1) CODA/DAQ is restarted (the link will automatically come up if it was up before CODA/DAQ was restarted)
  2) Radiation induced errors in RICH FPGA (requires a RICH LV power cycle to recover)

>From the RICH DAQ operations point-of-view the main issue is loosing the fiber link to tiles due to radiation induced FPGA upsets, which is expected to happen from time-to-time. It's not responsible for a big portion of CLAS12 DAQ downtime at all, so there are currently no plans to improve the firmware tolerance to this (though I would like to add SEU counters for us to monitor and get a better idea how often this is happening). Aside from this I'm only aware of this recent event where the FPGA temperature gave a bogus positive reading tripping the HV/LV - something we understand the cause, but not sure if there are any plans to fix/prevent that from happening again (it's tied to the DAQ code base, which gets compiled and updated sometimes).

So do we want to or need to change something?

Ben

~~~~~~~~~~~ 

From: "Tyler Lemon" <tlemon at jlab.org> 
To: "Chris Cuevas" <cuevas at jlab.org>, "Valery Kubarovsky" <vpk at jlab.org>, "Mirazita Marco" <Marco.Mirazita at lnf.infn.it>, rich at jlab.org, "dsg-rich" <dsg-rich at jlab.org> 
Sent: Wednesday, October 17, 2018 10:47:00 AM 
Subject: Re: [Dsg-rich] [Rich] FPGA temp alarm 

Hello, 

RICH's software interlocks are EPICS-based and were developed by Baltzell and Kubarovsky. The soft interlocks directly monitor the FPGA temperatures using the chip’s data readout. Hardware interlock system cannot monitor the FPGA internal temperature, even with some sort of probe. 

The five-second rate for the FPGA temperatures readout, I presume was set by Raydo during development. I also presume that the 10-second delay between temperature readings to trip soft interlocks was determined during the development of the soft interlock system by Baltzell and Kubarovsky. 

The cRIO-based hardware interlock system developed by DSG monitors temperatures, humidity measurements, flows, and pressures using independent sensors. 

The CAEN power supply mainframe has its own internal interlocks that monitor HV for the MAPMTs and LV for the FPGAs. The CAEN monitors voltage ans current outputs and can internally handle faults. 

We have seen recently that FPGA temperatures falsely jump to high values for five seconds before returning to a normal value. Why do the temperatures suddenly jump 50 deg C for one measurement period? I do not know and nobody else seems to know why the temperatures readings are spiking. 

We have also seen throughout the entirety of RICH operations that fibers suddenly disconnect. Sometimes the DAQ needs to be restarted to get them to reconnect, but sometimes they reconnect on their own after a few seconds. Why do the fibers disconnect? Again, I do not know and nobody else seems to know why.. 

Are there any concrete answers as to why the temperature readouts are suddenly jumping above the software interlock limit or as to why fibers keep disconnecting? 

Best regards, 

Tyler 

From: "Chris Cuevas" <cuevas at jlab.org> 
To: "Tyler Lemon" <tlemon at jlab.org> 
Cc: "Valery Kubarovsky" <vpk at jlab.org>, "Mirazita Marco" <Marco.Mirazita at lnf.infn.it>, rich at jlab.org, "dsg-rich" <dsg-rich at jlab.org> 
Sent: Tuesday, October 16, 2018 3:56:57 PM 
Subject: Re: [Rich] [Dsg-rich] FPGA temp alarm 

Hi, 

Software can stop for a variety of reasons, but good to know that there is a default watchdog timer on the cRIO. How much time elapses before software stoppage is detected and the relay opens? 

I am not concerned about a power off sequence because I know the CAEN mainframe and modules will latch OFF until new commands are issued. 

Thanks, 

-Chris 

~~~~~~~~~~~~~~~~~ 

On 10/16/2018 3:11 PM, Tyler Lemon wrote: 

If the software stops running, the relay goes into its default open state, disabling the CAEN. 

The most likely cause of the cRIO's software stopping is a power outage. In the event of a power outage, there will be a sequence of events (compressor losing power, airflow/tank pressure dropping, interlock tripping) that will disable the CAEN. All components of the hardware interlock system are on UPS, so the hardware interlock components will stay alive long enough after the power outage to disable the CAEN. When the power is restored, the cRIO is set up to automatically run the interlock program on boot. On power restore, the CAEN will remain disabled until all normal operating conditions (airflow, buffer tank pressure, temperatures) are restored and the CAEN's interlocks are cleared. 

--Tyler 

From: "Chris Cuevas" <cuevas at jlab.org> 
To: "Tyler Lemon" <tlemon at jlab.org> 
Cc: "Valery Kubarovsky" <vpk at jlab.org> , "Mirazita Marco" <Marco.Mirazita at lnf.infn.it> , rich at jlab.org , "dsg-rich" <dsg-rich at jlab.org> 
Sent: Tuesday, October 16, 2018 2:53:02 PM 
Subject: Re: [Rich] [Dsg-rich] FPGA temp alarm 

Hi Tyler, 

So the cRIO software must always be running to detect out of limit signals from the front end sensors? What happens if the software stops running? Is there a default timeout[watchdog] setting that will open the final relay contact? 

Thanks for the feedback, 

-Chris 

~~~~~~~~~~~ 

On 10/16/2018 1:52 PM, Tyler Lemon wrote: 

BQ_BEGIN

Chris, 

For the hardware interlocks, if any of the cRIO-based system's sensors (RTDs, humidity, airflow, air pressure, N2 flow; sensors not dependent on FPGA DAQ readout) go out of limit, the cRIO disables HV and LV by opening a relay contact to the Interlock port on the CAEN mainframe. 

This disables HV and LV and will only allow HV and LV to be re-enabled if the interlock is clear and a user has reset the interlock. 

Regards, 
Tyler 

From: "Chris Cuevas" <cuevas at jlab.org> 
To: "Valery Kubarovsky" <vpk at jlab.org> , "Mirazita Marco" <Marco.Mirazita at lnf.infn.it> 
Cc: rich at jlab.org , "dsg-rich" <dsg-rich at jlab.org> 
Sent: Tuesday, October 16, 2018 1:38:14 PM 
Subject: Re: [Rich] [Dsg-rich] FPGA temp alarm 

Hi All, 

The FPGAs have a firmware set point for the die temperature. Ben can verify this set point, but it sure sounds like the 'interlock' systems rely on software code. Are there watchdog timers set up in case the code stops running? 

For a true hardware interlock, one that does not depend on software, there should be temperature probes coupled directly to a comparator circuit which latches off the Low_Voltage outputs to the FPGA boards. From these email threads, it sounds like all the temperature devices are monitored with software, and if a software set point threshold is violated, then the Low_Voltage is turned OFF. If the monitored temperature devices go below the software set point threshold, is the Low_Voltage power reapplied automatically? 

Thanks, 

-Chris 

~~~~~~~~~~~~~~~~~ 
On 10/16/2018 9:40 AM, Valery Kubarovsky wrote: 

BQ_BEGIN

The detector (actually LV in this particular case, because the HV was off at that moment) 
was shut down by software interlock that monitors the FPGS temperature. 

BQ_BEGIN
From: "Marco Mirazita" <Marco.Mirazita at lnf.infn.it> 
To: "Tyler Lemon" <tlemon at jlab.org> 
Cc: "Valery Kubarovsky" <vpk at jlab.org> , rich at jlab.org , "dsg-rich" <dsg-rich at jlab.org> 
Sent: Tuesday, October 16, 2018 9:37:41 AM 
Subject: Re: [Rich] [Dsg-rich] FPGA temp alarm 

BQ_BEGIN
But we have a soft interlock shutting down the system if fpga 
temperature goes above 75 deg. 

Il 2018-10-16 14:24 Tyler Lemon ha scritto: 
> Hello Marco, 
> 
> The hardware interlock system does not monitor FPGA temperatures. It 
> monitors the RTDs installed in the EP. 
> 
> The hardware interlock system did not trip off the electronics because 
> the strange temperature readings were only seen on the FPGA 
> temperature readouts. 
> 
> Best regards, 
> Tyler 
> 
> ------------------------- 
> 
> FROM: "Mirazita Marco" <Marco.Mirazita at lnf.infn.it> 
> TO: "Valery Kubarovsky" <vpk at jlab.org> 
> CC: rich at jlab.org , "dsg-rich" <dsg-rich at jlab.org> 
> SENT: Tuesday, October 16, 2018 8:16:06 AM 
> SUBJECT: Re: [Rich] [Dsg-rich] FPGA temp alarm 
> 
> Hi Valery, 
> thank you, in fact my main worry was that the interlock didn't shut 
> down 
> the electronics. 
> It would be anyway good to understand why we had this event. 
> I saw from the strip charts that also the LV currents and voltages 
> have 
> frequent random spikes. 
> Marco 
> 
> Il 2018-10-16 13:55 Valery Kubarovsky ha scritto: 
>> Marco, 
>> If you take a more careful look you will find out that the LV was 
> shut 
>> down. 
>> The HV was off. We investigate the reason. It is the first time we 
> had 
>> such an event. 
>> Probably it was connected with the software update that was done 
>> approximately at the same time. 
>> Regards, 
>> Valery 
>> 
>> ------------------------- 
>> 
>>> FROM: "Marco Mirazita" <Marco.Mirazita at lnf.infn.it> 
>>> TO: dsg-rich at jlab.org , rich at jlab.org 
>>> SENT: Tuesday, October 16, 2018 4:03:24 AM 
>>> SUBJECT: [Dsg-rich] FPGA temp alarm 
>> 
>>> Hi all, 
>>> yesterday at about 4:20 pm jlab time I received several alarm 
>>> messages 
>>> saying that the FPGA temperatures reached values around 115 deg. 
>>> Since it looked like a readout error, I checked the strip chart, 
>>> where 
>>> there is actually a spike in the readout at that time, but to 
> values 
>>> of 
>>> the order of -10^5. See the attached plot. 
>>> So, it is clear that it was a readout error, but it is strange that 
>>> the 
>>> alarm message and epics report different values. 
>>> Also, according to the logbook, the electronics was not shut down 
> by 
>>> the 
>>> interlocks, as should have happened if the temperature really went 
>>> above 
>>> 100 deg. 
>>> So, perhaps the values reported in the alarm messages are not the 
>>> correct ones? 
>>> Marco 
>>> 
>>> _______________________________________________ 
>>> Dsg-rich mailing list 
>>> Dsg-rich at jlab.org 
>>> https://mailman.jlab.org/mailman/listinfo/dsg-rich 
>> 
>> _______________________________________________ 
>> Dsg-rich mailing list 
>> Dsg-rich at jlab.org 
>> https://mailman.jlab.org/mailman/listinfo/dsg-rich 
> _______________________________________________ 
> Rich mailing list 
> Rich at jlab.org 
> https://mailman.jlab.org/mailman/listinfo/rich 

BQ_END

_______________________________________________
Rich mailing list Rich at jlab.org https://mailman.jlab.org/mailman/listinfo/rich 

BQ_END

_______________________________________________ 
Rich mailing list 
Rich at jlab.org 
https://mailman.jlab.org/mailman/listinfo/rich 

BQ_END

BQ_END

_______________________________________________ 
Dsg-rich mailing list 
Dsg-rich at jlab.org 
https://mailman.jlab.org/mailman/listinfo/dsg-rich