<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
Hello,<br>
<br>
I would like to share the network issue this weekend and the reason for
not rebooting IOC's without doing a little investigation.<br>
<br>
First the short version:<br>
<ul>
<li>I got a dead iocptarg and ioctorus1 page.</li>
<li>I didn't reboot the IOC's but checked on them. The network was
down but not the IOC's. They were functioning in the hall.<br>
</li>
<li>I called the Counting House and told them not to reboot any
IOC's. If the IOC's were rebooted they would have not come back up,
HDice would not fill, and its very possible that the HDice target would
have failed and a long recovery to make a new target would have
resulted (months as I understand it). Not good! This event could have
been very bad for the experiment.<br>
</li>
<li>I called Brad at Accelerator support and he found a burned up
Media Switch. He had to replace the whole chassis and it took about
two or three hours.</li>
<li>The network came back and business return to normal</li>
</ul>
Now the long version:<br>
<br>
On Sunday morning I received the following pages:<br>
<pre wrap="">IOC iocptarg is dead - Val = 0.000 [Sun Dec 18 07:48:05 2011]
IOC ioctorus1 is dead - Val = 0.000 [Sun Dec 18 07:48:05 2011]
So the Buffer Dewar and Torus IOC's are dead. Now, what has been done sometimes in the past is rebooting the IOC straight away. I have explained in our meetings not to go off and reboot IOC's as a first step in ones investigation of a dead IOC and I'll go into details.
Let me give a basic explain of how an IOC works in somewhat of an analogy to PC's. We reboot an IOC (just like ones desktop PC) and it finds its BIOS (Basic Input/Output System and you'll see this as the PC comes up before it loads Windows). This BIOS lives on the electronics. Its fundamental software and its job is to find its hardware and become cognizant of its basic mission in life which in this case is to find the network and go to an IP address and loads a kernel (On PC's this is on your hard drive for IOC's its <i><b>downloaded</b></i> from another computer). The kernel is the central part of the operating system and one could look at it as the Windows of the IOC except this time were loading VME. Once we have VME loaded up, we load various EPICS database files that actually control the inputs and outputs and readouts for the Buffer Dewar or the Torus. Now the IOC is up and running and we can look at the EPICS screens.
Sometimes an IOC will report itself dead and what that actually means is its network is down.
</pre>
<blockquote>clasrun@clon01:clasrun>ping ioctorus1<br>
ioctorus1 is dead<br>
</blockquote>
<pre wrap="">
This means that clon01 can not verify the connection to ioctorus1. It is dead from clon01's perspective because it doesn't have a network connect to it. The paging system sees this and issues a page. The IOC may in fact still be functioning in the hall and will remain functioning until we reboot it. Once we reboot the IOC, it loads its BIOS, discovers its main purpose in life, which is to get a network connection and download the kernel, but wait! it stops right there because there isn't a network connection and now the IOC really is dead.
If this were to of happened this Sunday, HDice's target would have flown South for the winter more than likely. This is very important and the whole reason for this email. It may be best to double check with someone if you don't know how. If the IOC is not critical, go ahead and reboot, but this one is very important to HDice.
So, we do not want to reboot the IOC's this time! Nope. They were down in the hall running and doing their jobs independent of the network. How did I know this? I tsconnected to the IOC. This is the same as taking a laptop down to the hall and plugging into the serial port.
</pre>
<blockquote>[bruhwel@kris1 ~]$ ssh clasrun@clon10<br>
Password: <br>
<br>
vxwrks@iocptarg -> dbgf "LL8210"<br>
DBR_DOUBLE: 70.9 <br>
value = 0 = 0x0<br>
<br>
</blockquote>
So we now know the level of the Buffer Dewar.<br>
<blockquote>vxwrks@iocptarg -> dbgf "iocptarg:heartbeat"<br>
DBR_DOUBLE: 4.74208e+06 <br>
vxwrks@iocptarg -> dbgf "iocptarg:heartbeat"<br>
DBR_DOUBLE: 4.74211e+06 <br>
<br>
</blockquote>
And now we know that the IOC has a heartbeat because its value has
increased. On reboot this will go to zero and increment from there.
That's why its called a heartbeat.<br>
<br>
Because the network was down, the Buffer Dewar was not receiving a very
important network signal from HDice which is hdice:level_bath, but its
value was stuck at 68.9% and thus the EV8215 valve remained at about
48% open which maintained the level in HDice. We got lucky there so we
monitored it and waited to see if any actions might be needed. If an
action were needed due to HDice level, I would put the valve in manual
mode and adjust it from the IOC.<br>
<br>
<blockquote>vxwrks@iocptarg -> dbgf "EV8215.MMOD"<br>
DBR_STRING: NO <br>
value = 0 = 0x0<br>
vxwrks@iocptarg -> dbgf "EV8215.MVAL"<br>
DBR_FLOAT: 50 <br>
value = 0 = 0x0<br>
vxwrks@iocptarg ->dbgf "EV8215.PMOD"<br>
DBR_STRING: Normal <br>
value = 0 = 0x0<br>
<br>
</blockquote>
<br>
That tells me EV8215 is in PID mode with a manual valve position of
50%.
<blockquote>vxwrks@iocptarg -> dbpf "EV8215.MVAL", "48"<br>
DBR_FLOAT: 48 <br>
value = 0 = 0x0<br>
vxwrks@iocptarg -> dbpf "EV8215.MMOD", "1"<br>
DBR_STRING: YES <br>
value = 0 = 0x0<br>
vxwrks@iocptarg -> dbgf "EV8215.PMOD"<br>
DBR_STRING: Manual <br>
value = 0 = 0x0<br>
<br>
</blockquote>
Now the valve is in Manual Mode with a value of 48%<br>
<blockquote>vxwrks@iocptarg -> dbpf "EV8215.MMOD", "0"<br>
DBR_STRING: NO <br>
value = 0 = 0x0<br>
<br>
dbgf "EV8215.PMOD"<br>
DBR_STRING: Normal <br>
value = 0 = 0x0<br>
<br>
</blockquote>
Back to PID mode.<br>
<br>
How do I know these field names? Even with the white screens on the
EPICS GUI's one can right click and do a PV Info and this will tell you
the names, or one could look up the database definitions via the <a
href="http://www.aps.anl.gov/epics/EpicsDocumentation/AppDevManuals/RecordRef/Recordref-1.html">EPICS
documents</a>. Here are some useful commands:<br>
<br>
<blockquote><b>dbl</b> lists all the records<br>
<b>dbgrep "pattern"</b> prints the record types that meet a pattern
(regexps) like this: dbgrep "EV82*"<br>
<b>dbgf "record"</b> prints the value of that record. If nothing it
given VAL is assumed.<br>
<b>dbpf "record", "value"</b> assigns the value. If nothing it
given VAL is assumed.<br>
<b>dbpr "record", "interest level"</b> prints info about the record.
Example below:<br>
<br>
Some info:<br>
<blockquote>vxwrks@iocptarg -> dbpr "EV8215"<br>
ASG: DEFAULT CVAL: 71.7 DESC: pid record DISA:
0 <br>
DISP: 0 DISV: 1 DM: 0 LOVL:
70 <br>
MVAL: 48 NAME: EV8215 ORBV: 49.51172 OUTV:
0 <br>
OVAL: 50.51172 SEVR: NO_ALARM STAT: NO_ALARM SVAL:
0 <br>
TPRO: 0 VAL: 70 <br>
value = 0 = 0x0<br>
</blockquote>
<br>
Lots of info:<br>
<blockquote>vxwrks@iocptarg -> dbpr "EV8215", "1"<br>
ACKS: NO_ALARM ACKT: YES ADEL: 0 ALST:
50.86572 <br>
ASG: DEFAULT ASP: 0x00000000 BKPT: 0x00 CT:
284617364 <br>
CVAL: 72.1 CVL:DB_LINK EV8215W.VAL NPP NMS D:
0 <br>
DERR: 1.199997 DESC: pid record DISA: 0 DISP:
0 <br>
DISS: NO_ALARM DISV: 1 DM: 0 DMAX:
1 <br>
DMIN: 0.1 DPVT: 0x00000000 DSET: 0x00000000 DT:
100 <br>
DTYP: (null) EGU: PCT ERR: -1.300003 EVNT:
0 <br>
FLNK:DB_LINK EV8215O.VAL HHSV: NO_ALARM HIGH:
0 <br>
HIHI: 0 HOPI: 110 HOPR: 0 HSV:
NO_ALARM <br>
HYST: 0 I: -0.780002 KD: 0 KI:
0.006 <br>
KP: 2 LALM: 70 LCNT: 0 LLSV:
NO_ALARM <br>
LMOD: Normal LOC:CONSTANT 1 LOLO: 0 LOPI:
-10 <br>
LOPR: 0 LOVL: 70 LOW: 0 LSET:
0x0f57aa80 <br>
LSV: NO_ALARM MAX: 52 MDEL: 0 MDT:
100 <br>
MIN: 25 MLIS: 0x0f46e6940f46dc4400000013 <br>
MLOK: 0x0f58ccc000000000 MLST: 50.86572 MMOD:
NO <br>
MVAL: 48 NAME: EV8215 NSEV: NO_ALARM NSTA:
NO_ALARM <br>
ODEL: 0 ODM: 1 OMOD: Change <br>
ORBL:DB_LINK EV8215R NPP NMS ORBV: 50.19531 <br>
OUT:CONSTANT 0.000000000000000e+00 OUTV: 0 OVAL:
51.19531 <br>
P: 2.399994 PACT: 0 PHAS: 0 PINI:
NO <br>
PMOD: Normal PPN: 0x00000000 PPNN: 0x0f3cd828 PREC:
3 <br>
PRIO: LOW PROC: 0 PUTF: 0 RDES:
0x0f5eb100 <br>
RPRO: 0 RSET: 0x0f5eae20 SCAN: Passive <br>
SDIS:CONSTANT 0.000000000000000e+00 SEVR: NO_ALARM SMOD:
NO <br>
SPVT: 0x00000000 STAT: NO_ALARM SVAL: 0 <br>
TIME: 0x2950c4ba029df2e6 TPRO: 0 TSE:
0 <br>
TSEL:CONSTANT UDF: 0 VAL: 70 <br>
value = 0 = 0x0<br>
<br>
</blockquote>
The way a record is composed is as follows:<br>
<blockquote>"myRecord" is the name of the database record.<br>
"myRecord.VAL" is the present value of the record. If no extention is
given to a record VAL is assumed.<br>
"myRecord.AnythingYouSeeFromDBPR" use the extention you want.<br>
</blockquote>
</blockquote>
Not everyone will want or know how to do this, but I hope the
explanation was useful. If this were to happen to you while on call
and you don't feel comfortable with this, many people can check into
this for you. Stepan, Sergey, Nerses, myself and others that work with
EPICS can double check an IOC before rebooting.<br>
<br>
Regards,<br>
Krister<br>
<blockquote><br>
<br>
<br>
</blockquote>
<blockquote><br>
</blockquote>
<blockquote><br>
</blockquote>
<blockquote><br>
<br>
<br>
<br>
<br>
</blockquote>
<br>
<br>
<pre wrap="">
</pre>
<br>
</body>
</html>