[Dsg-halla_ecal] [Revised Logentry] ECal Controls Heartbeat Alarm added

Tue May 6 10:25:02 EDT 2025

Logentry Text:
--
I added a new ECal controls heartbeat alarm to the Alarm Handler. A week ago we had a full 16channel card of thermocouple readout channels stopped responding and all read out 0. In this case it was harmless since it was crystal temperatures and not part of the controls system. However, it made us wonder if we were susceptible to a phenomenon where a number of control temperature channels stopped reading back and were stuck at some unknown temperature. In a worst case scenario, the temperatures all read out well below the setpoints and the heaters turn on full and do not turn off. Eventually, the ambient temperature alarms would alert us, but it is possible that a section of the detector could overheat by then damaging, for example, the glue holding the light guide in place. I wrote a script that checks each of the 48 controls temperatures in EPICS twice with a 1 second delay in between each reading. If a channel gives the same value twice, the ECAL_Controls_Heartbeat PV is in
 cremente
 d by 1. It is not unexpected that we would get one or two false positives each time through since the delay between samples is small and the temperatures are fairly stable. To reduce sensitivity to false positives, the script decrements ECAL_Controls_Heartbeat by 0.1 for every channel that reads different values, but ECAL_Controls_Heartbeat is not allowed to go below 0. The script runs every 5 minutes, so if a full 16 channel card went unresponsive again, the ECAL_Controls_Heartbeat would increment by 12.8. I have set the ECAL_Controls_Heartbeat.HIGH = 35 and ECAL_Controls_Heartbeat.HIHI = 50 so the first yellow warning would sound after 15 minutes of a full card going unresponsive and a full HIHI alarm after 20-25 minutes. Attachment 1 shows the value of ECAL_Controls_Heartbeat over the past three days. The spike happened when Marc restarted the cRIO, so for a short period all channels were unresponsive. 
Note that if a single channel were to become unresponsive, this alarm would not likely be elevated. In fact, anything less than 5 channels would likely not raise this to an alarm level. For 5 unresponsive channels, the heartbeat will go up 0.7 each time the script runs so it will take 1.5-2hours to reach an alarm.  We can play with the sensitivity to make it more or less responsive as necessary, but I don't want to have this alarm going off every time an EPICS IOC is busy. 
I have set the manual mode control voltages to typical values for 210degC crystal temperatures so that we can switch from PID to Manual control if some channels become unresponsive. If a group of temperature channels for the controls become unresponsive the proper response is
1. Try rebooting the expansion chassis (see instructions https://sbs.jlab.org/cgi-bin/DocDB/private/ShowDocument?docid=690)
2. If that fails, switch to manual mode temporarily to keep things close to the desired temperatures until the issue can be resolved. The voltage outputs should be close to those shown in attachment 2.  

---

This is a plain text email for clients that cannot display HTML.  The full logentry can be found online at https://logbooks.jlab.org/entry/4370281
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/dsg-halla_ecal/attachments/20250506/f42fedf4/attachment-0001.htm>