[Hallb-engineering] (no subject)

A.M. Sandorfi sandorfi at jlab.org
Tue Dec 20 11:28:23 EST 2011


Thanks Dave,
We'll see how it goes on Thursday when IT goes down.
Regarding the IBC, for the past two weeks it has been running very stably,
so much so that we are inclined to leave it where it is for the winter
break. That means, unless something dramatic happens in the next couple of
days, we won't be pulling it out of CLAS and rotating vertical on Thursday.
The 1 K pot has also been quite stable, so we'll delay changing the window
until the new year.
Thanks,...Andy



On 12/20/11 11:07 AM, "David Kashy" <kashy at jlab.org> wrote:

> Hi All,
> 
> To follow-up on Krister's detailed explanation and to give more info.
> 
> Typically when an IOC fails with the type of valves we use in the buffer
> dewar, the valves stay in the same position as they were right before
> the failure.
> 
> We set the PID values so that the minimum setting will keep the flow
> going for important things like liquid levels. We also have set the
> gains such that the valves move slowly and small amounts thus even with
> a dead IOC the system can often run stably for many hours before
> alarming. Many of the on call folks might remember the dewar level
> slowly falling for many hours to say 50% before an alarm is made on
> liquid level. This is due to those things I describe above. The PID is
> not updating because the IOC is dead, but the valves are in such
> positions that the dewar is still getting liquid helium and staying at
> constant pressure so the targets can continue to operate but the level
> is slowly falling (or rising...)
> 
> Dave
> 
> 
> 
> 
> On 12/19/2011 12:15 PM, Krister Bruhwel wrote:
>> Actually its VxWorks we load. VME is the crate  ... sorry about that.
>> 
>> Krister Bruhwel wrote:
>>> Hello,
>>> 
>>> I would like to share the network issue this weekend and the reason
>>> for not rebooting IOC's without doing a little investigation.
>>> 
>>> First the short version:
>>> 
>>>      * I got a dead iocptarg and ioctorus1 page.
>>>      * I didn't reboot the IOC's but checked on them.  The network was
>>>        down but not the IOC's.  They were functioning in the hall.
>>>      * I called the Counting House and told them not to reboot any
>>>        IOC's.  If the IOC's were rebooted they would have not come back
>>>        up, HDice would not fill, and its very possible that the HDice
>>>        target would have failed and a long recovery to make a new
>>>        target would have resulted (months as I understand it).  Not
>>>        good!  This event could have been very bad for the experiment.
>>>      * I called Brad at Accelerator support and he found a burned up
>>>        Media Switch.  He had to replace the whole chassis and it took
>>>        about two or three hours.
>>>      * The network came back and business return to normal
>>> 
>>> Now the long version:
>>> 
>>> On Sunday morning I received the following pages:
>>> IOC iocptarg is dead - Val = 0.000  [Sun Dec 18 07:48:05 2011]
>>> IOC ioctorus1 is dead - Val = 0.000  [Sun Dec 18 07:48:05 2011]
>>> 
>>> So the Buffer Dewar and Torus IOC's are dead.  Now, what has been done
>>> sometimes in the past is rebooting the IOC straight away.  I have explained
>>> in our meetings not to go off and reboot IOC's as a first step in ones
>>> investigation of a dead IOC and I'll go into details.
>>> 
>>> Let me give a basic explain of how an IOC works in somewhat of an analogy to
>>> PC's.  We reboot an IOC (just like ones desktop PC) and it finds its BIOS
>>> (Basic Input/Output System and you'll see this as the PC comes up before it
>>> loads Windows).  This BIOS lives on the electronics.  Its fundamental
>>> software and its job is to find its hardware and become cognizant of its
>>> basic mission in life which in this case is to find the network and go to an
>>> IP address and loads a kernel (On PC's this is on your hard drive for IOC's
>>> its /*downloaded*/ from another computer).  The kernel is the central part
>>> of the operating system and one could look at it as the Windows of the IOC
>>> except this time were loading VME.  Once we have VME loaded up, we load
>>> various EPICS database files that actually control the inputs and outputs
>>> and readouts for the Buffer Dewar or the Torus.  Now the IOC is up and
>>> running and we can look at the EPICS screens.
>>> 
>>> Sometimes an IOC will report itself dead and what that actually means is its
>>> network is down.
>>> 
>>> 
>>> 
>>>      clasrun at clon01:clasrun>ping ioctorus1
>>>      ioctorus1 is dead
>>> 
>>> This means that clon01 can not verify the connection to ioctorus1.  It is
>>> dead from clon01's perspective because it doesn't have a network connect to
>>> it.  The paging system sees this and issues a page.  The IOC may in fact
>>> still be functioning in the hall and will remain functioning until we reboot
>>> it.  Once we reboot the IOC, it loads its BIOS, discovers its main purpose
>>> in life, which is to get a network connection and download the kernel, but
>>> wait!  it stops right there because there isn't a network connection and now
>>> the IOC really is dead.
>>> 
>>> If this were to of happened this Sunday, HDice's target would have flown
>>> South for the winter more than likely.  This is very important and the whole
>>> reason for this email.  It may be best to double check with someone if you
>>> don't know how.  If the IOC is not critical, go ahead and reboot, but this
>>> one is very important to HDice.
>>> 
>>> So, we do not want to reboot the IOC's this time!  Nope.  They were down in
>>> the hall running and doing their jobs independent of the network.  How did I
>>> know this?  I tsconnected to the IOC.  This is the same as taking a laptop
>>> down to the hall and plugging into the serial port.
>>> 
>>> 
>>> 
>>>      [bruhwel at kris1 ~]$ ssh clasrun at clon10
>>>      Password:
>>> 
>>>      vxwrks at iocptarg ->  dbgf "LL8210"
>>>      DBR_DOUBLE: 70.9
>>>      value = 0 = 0x0
>>> 
>>> So we now know the level of the Buffer Dewar.
>>> 
>>>      vxwrks at iocptarg ->  dbgf "iocptarg:heartbeat"
>>>      DBR_DOUBLE:         4.74208e+06
>>>      vxwrks at iocptarg ->  dbgf "iocptarg:heartbeat"
>>>      DBR_DOUBLE:         4.74211e+06
>>> 
>>> And now we know that the IOC has a heartbeat because its value has
>>> increased.  On reboot this will go to zero and increment from there.
>>> That's why its called a heartbeat.
>>> 
>>> Because the network was down, the Buffer Dewar was not receiving a
>>> very important network signal from HDice which is hdice:level_bath,
>>> but its value was stuck at 68.9% and thus the EV8215 valve remained at
>>> about 48% open which maintained the level in HDice.  We got lucky
>>> there so we monitored it and waited to see if any actions might be
>>> needed.  If an action were needed due to HDice level, I would put the
>>> valve in manual mode and adjust it from the IOC.
>>> 
>>>      vxwrks at iocptarg ->   dbgf "EV8215.MMOD"
>>>      DBR_STRING:          NO
>>>      value = 0 = 0x0
>>>      vxwrks at iocptarg ->  dbgf "EV8215.MVAL"
>>>      DBR_FLOAT:          50
>>>      value = 0 = 0x0
>>>      vxwrks at iocptarg ->dbgf "EV8215.PMOD"
>>>      DBR_STRING:          Normal
>>>      value = 0 = 0x0
>>> 
>>> 
>>> That tells me EV8215 is in PID mode with a manual valve position of 50%.
>>> 
>>>      vxwrks at iocptarg ->  dbpf "EV8215.MVAL", "48"
>>>      DBR_FLOAT:          48
>>>      value = 0 = 0x0
>>>      vxwrks at iocptarg ->  dbpf "EV8215.MMOD", "1"
>>>      DBR_STRING:          YES
>>>      value = 0 = 0x0
>>>      vxwrks at iocptarg ->  dbgf "EV8215.PMOD"
>>>      DBR_STRING:          Manual
>>>      value = 0 = 0x0
>>> 
>>> Now the valve is in Manual Mode with a value of 48%
>>> 
>>>      vxwrks at iocptarg ->  dbpf "EV8215.MMOD", "0"
>>>      DBR_STRING:          NO
>>>      value = 0 = 0x0
>>> 
>>>      dbgf "EV8215.PMOD"
>>>      DBR_STRING:          Normal
>>>      value = 0 = 0x0
>>> 
>>> Back to PID mode.
>>> 
>>> How do I know these field names?  Even with the white screens on the
>>> EPICS GUI's one can right click and do a PV Info and this will tell
>>> you the names, or one could look up the database definitions via the
>>> EPICS documents
>>> <http://www.aps.anl.gov/epics/EpicsDocumentation/AppDevManuals/RecordRef/Rec
>>> ordref-1.html>.
>>> Here are some useful commands:
>>> 
>>>      *dbl* lists all the records
>>>      *dbgrep "pattern"* prints the record types that meet a pattern
>>>      (regexps)  like this: dbgrep "EV82*"
>>>      *dbgf "record"* prints the value of that record.  If nothing it
>>>      given VAL is assumed.
>>>      *dbpf  "record", "value"*  assigns the value.  If nothing it given
>>>      VAL is assumed.
>>>      *dbpr "record", "interest level"* prints info about the record.
>>>      Example below:
>>> 
>>>      Some info:
>>> 
>>>          vxwrks at iocptarg ->  dbpr "EV8215"
>>>          ASG: DEFAULT        CVAL: 71.7          DESC: pid record
>>>          DISA: 0
> 
>>>          DISP: 0             DISV: 1             DM: 0
>>>          LOVL: 70
>>>          MVAL: 48            NAME: EV8215        ORBV: 49.51172
>>>          OUTV: 0
>>>          OVAL: 50.51172      SEVR: NO_ALARM      STAT: NO_ALARM
>>>          SVAL: 0
>>>          TPRO: 0             VAL: 70
>>>          value = 0 = 0x0
>>> 
>>> 
>>>      Lots of info:
>>> 
>>>          vxwrks at iocptarg ->  dbpr "EV8215", "1"
>>>          ACKS: NO_ALARM      ACKT: YES           ADEL: 0
>>>          ALST: 50.86572
>>>          ASG: DEFAULT        ASP: 0x00000000     BKPT: 0x00
>>>          CT: 284617364
>>>          CVAL: 72.1          CVL:DB_LINK EV8215W.VAL NPP NMS         D:
>>>          0
>>>          DERR: 1.199997      DESC: pid record    DISA: 0
>>>          DISP: 0
>>>          DISS: NO_ALARM      DISV: 1             DM: 0
>>>          DMAX: 1
>>>          DMIN: 0.1           DPVT: 0x00000000    DSET: 0x00000000
>>>          DT: 100
>>>          DTYP: (null)        EGU: PCT            ERR: -1.300003
>>>          EVNT: 0
>>>          FLNK:DB_LINK EV8215O.VAL                HHSV: NO_ALARM
>>>          HIGH: 0
>>>          HIHI: 0             HOPI: 110           HOPR: 0
>>>          HSV: NO_ALARM
>>>          HYST: 0             I: -0.780002        KD: 0
>>>          KI: 0.006
>>>          KP: 2               LALM: 70            LCNT: 0
>>>          LLSV: NO_ALARM
>>>          LMOD: Normal        LOC:CONSTANT 1      LOLO: 0
>>>          LOPI: -10
>>>          LOPR: 0             LOVL: 70            LOW: 0
>>>          LSET: 0x0f57aa80
>>>          LSV: NO_ALARM       MAX: 52             MDEL: 0
>>>          MDT: 100
>>>          MIN: 25             MLIS: 0x0f46e6940f46dc4400000013
>>>          MLOK: 0x0f58ccc000000000                MLST: 50.86572
>>>          MMOD: NO
>>>          MVAL: 48            NAME: EV8215        NSEV: NO_ALARM
>>>          NSTA: NO_ALARM
>>>          ODEL: 0             ODM: 1              OMOD: Change
>>>          ORBL:DB_LINK EV8215R NPP NMS            ORBV: 50.19531
>>>          OUT:CONSTANT 0.000000000000000e+00      OUTV: 0
>>>          OVAL: 51.19531
>>>          P: 2.399994         PACT: 0             PHAS: 0
>>>          PINI: NO
>>>          PMOD: Normal        PPN: 0x00000000     PPNN: 0x0f3cd828
>>>          PREC: 3
>>>          PRIO: LOW           PROC: 0             PUTF: 0
>>>          RDES: 0x0f5eb100
>>>          RPRO: 0             RSET: 0x0f5eae20    SCAN: Passive
>>>          SDIS:CONSTANT 0.000000000000000e+00     SEVR: NO_ALARM
>>>          SMOD: NO
>>>          SPVT: 0x00000000    STAT: NO_ALARM      SVAL: 0
>>>          TIME: 0x2950c4ba029df2e6                TPRO: 0
>>>          TSE: 0
>>>          TSEL:CONSTANT       UDF: 0              VAL: 70
>>>          value = 0 = 0x0
>>> 
>>>      The way a record is composed is as follows:
>>> 
>>>          "myRecord" is the name of the database record.
>>>          "myRecord.VAL" is the present value of the record.  If no
>>>          extention is given to a record VAL is assumed.
>>>          "myRecord.AnythingYouSeeFromDBPR" use the extention you want.
>>> 
>>> Not everyone will want or know how to do this, but I hope the
>>> explanation was useful.  If this were to happen to you while on call
>>> and you don't feel comfortable with this, many people can check into
>>> this for you.  Stepan, Sergey, Nerses, myself and others that work
>>> with EPICS can double check an IOC before rebooting.
>>> 
>>> Regards,
>>> Krister
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------
>>> 
>>> _______________________________________________
>>> Hallb-engineering mailing list
>>> Hallb-engineering at jlab.org
>>> https://mailman.jlab.org/mailman/listinfo/hallb-engineering
>> 
>> _______________________________________________
>> Hallb-engineering mailing list
>> Hallb-engineering at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/hallb-engineering
> 
> 





More information about the Hallb-engineering mailing list