[Lerftest-ctrls] RF CPU reboot & iocConsole problems
Sonya Hoobler
sonya at slac.stanford.edu
Fri Sep 21 18:41:09 EDT 2018
Thank you so much, Wesley!
Sonya
On Fri, 21 Sep 2018, Wesley Moore wrote:
> Resolved.
>
>
> Talked with Sonya and she successfully tested after restoring the tftp server and config. Like the sshd issue, it was due to another site admin being sloppy with something they were doing. Our puppet config for LCLS-II was modified to watch for the ssh issue and also to be able to reconfigure/restart the tftp server, should it ever be modified again.
>
>
> Wesley
>
> ________________________________
> From: Sonya Hoobler <sonya at slac.stanford.edu>
> Sent: Friday, September 21, 2018 1:41:59 PM
> To: Wesley Moore
> Cc: lerftest-ctrls at jlab.org; Curt Hovater; se_c at cox.net
> Subject: Re: [Lerftest-ctrls] RF CPU reboot & iocConsole problems
>
> Hi Wesley,
>
> (Re-sending. Rejected by mailing list due to size.)
>
> Well, the previous times this happened seemed to be caused by (according to
> my emails):
>
> -multiple dchp servers on the network
>
> -tftp issue due to a puppet error
>
> Do you have any diagnostics on the tftp server side? Can you see if it's
> receiving a request?
>
> Boot failure messages:
>
> CLIENT MAC ADDR: 74 FE 48 28 BC 12 GUID: E0C325E5 39CB E511 A2F4
> 0108A6DAC621
> CLIENT IP: 129.57.244.52 MASK: 255.255.255.0 DHCP IP: 129.57.34.93
> GATEWAY IP: 129.57.244.1
> PXE-E32: TFTP open timeout
> PXE-M0F: Exiting Intel Boot Agent.
>
> Thanks,
> Sonya
>
>>
>>
>> On Fri, 21 Sep 2018, Wesley Moore wrote:
>>
>>> We had a minor ssh issue. Just troublesome to sort, since I couldn't ssh
>>> in to look around.
>>>
>>>
>>> Sounds like something is still quite wrong with lcls-llrfcpu02. What do we
>>> need to do to help troubleshoot it?
>>>
>>>
>>> Wesley
>>>
>>> ________________________________
>>> From: Sonya Hoobler <sonya at slac.stanford.edu>
>>> Sent: Friday, September 21, 2018 11:54:31 AM
>>> To: Wesley Moore
>>> Cc: lerftest-ctrls at jlab.org; Curt Hovater; se_c at cox.net
>>> Subject: Re: [Lerftest-ctrls] RF CPU reboot & iocConsole problems
>>>
>>> Hi again,
>>>
>>> I see that lcls-llrfcpu02 still does not boot--tftp timeout.
>>>
>>> I'll go back to hands-off.
>>>
>>> Sonya
>>>
>>>
>>>
>>> On Fri, 21 Sep 2018, Sonya Hoobler wrote:
>>>
>>>> Hi Wesley,
>>>>
>>>> I just logged in to look around and things seem improved.
>>>>
>>>> Was something done to address the problems?
>>>>
>>>> Thanks,
>>>> Sonya
>>>>
>>>>
>>>> On Thu, 20 Sep 2018, Sonya Hoobler wrote:
>>>>
>>>>> Hi Wesley,
>>>>>
>>>>> Thank you for the update and for following up.
>>>>>
>>>>> I won't do anything until hearing back from you.
>>>>>
>>>>> Sonya
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 20 Sep 2018, Wesley Moore wrote:
>>>>>
>>>>>> Sonya,
>>>>>>
>>>>>> Larry rebooted lcls-llrfcpu02. Said it powered back up, but isn't
>>>>>> showing
>>>>>> any connectivity. Looks the same from my end.
>>>>>>
>>>>>> lclsfs - can't ssh, but pingable
>>>>>> lclsapp1 - seems fine
>>>>>> lclsapp2 - can't ssh, but pingable
>>>>>> lcls-llrfcpu01 - seems fine
>>>>>> lcls-llrfcpu02 - can't ssh, can't ping
>>>>>> Control room hosts (lclsl01-03): can't ssh, but pingable
>>>>>>
>>>>>> Let me follow up with the guy that setup the fileserver and see if we
>>>>>> can
>>>>>> get that checked out first. We may need to reboot stuff after that's
>>>>>> sorted out.
>>>>>>
>>>>>> Wesley
>>>>>>
>>>>>> On 9/20/18, 9:39 AM, "Lerftest-ctrls on behalf of Wesley Moore"
>>>>>> <lerftest-ctrls-bounces at jlab.org on behalf of wmoore at jlab.org> wrote:
>>>>>>
>>>>>> Looks like at least lcls-llrfcpu02 needs to be rebooted. Others seem
>>>>>> likely as well. The control room hosts aren't connecting either. Have
>>>>>> you heard anything from Larry?
>>>>>>
>>>>>> Wesley
>>>>>>
>>>>>> On 9/19/18, 7:20 PM, "Lerftest-ctrls on behalf of Sonya Hoobler"
>>>>>> <lerftest-ctrls-bounces at jlab.org on behalf of sonya at slac.stanford.edu>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Wesley, all,
>>>>>>
>>>>>> I just tried a reboot of RF CPU lcls-llrfcpu02 and it never
>>>>>> successfully
>>>>>> re-booted up.
>>>>>>
>>>>>> I can't view the boot-up sequence because iocConsole is also no
>>>>>> longer
>>>>>> working for either CPU:
>>>>>>
>>>>>> [softegr at lclsapp1 iocCommon]$ iocConsole lcls-llrfcpu01
>>>>>> : ssh -x -t -l laci lclsapp2.acc.jlab.org bash -l -c "
>>>>>> pyiocscreen.py -t HIOC lcls-llrfcpu01 lclsts1 2001 "
>>>>>> Read from socket failed: Connection reset by peer
>>>>>> [softegr at lclsapp1 iocCommon]$ iocConsole lcls-llrfcpu02
>>>>>> : ssh -x -t -l laci lclsapp2.acc.jlab.org bash -l -c "
>>>>>> pyiocscreen.py -t HIOC lcls-llrfcpu02 lclsts1 2002 "
>>>>>> Read from socket failed: Connection reset by peer
>>>>>>
>>>>>> I tried a remote reboot of the terminal server.
>>>>>>
>>>>>> I also tried ipmitool (and EPICS ipmi) to remotely restart the
>>>>>> CPU.
>>>>>>
>>>>>> But still no signs of life.
>>>>>>
>>>>>> Perhaps you could take a look at the network and/or locally? We
>>>>>> may
>>>>>> need a
>>>>>> local power-cycle of the CPU and the terminal server. I'm cc'ing
>>>>>> Larry
>>>>>> Farrish, who may also be able to help with that.
>>>>>>
>>>>>> This is not super urgent. When you have a chance during your
>>>>>> normal
>>>>>> working hours, I'd appreciate any help.
>>>>>>
>>>>>> Thanks,
>>>>>> Sonya
>>>>>>
>>>>>> _______________________________________________
>>>>>> Mailing List: Lerftest-ctrls at jlab.org
>>>>>> https://mailman.jlab.org/mailman/listinfo/lerftest-ctrls
>>>>>> Wiki: https://wiki.jlab.org/lerf/index.php/Network
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Mailing List: Lerftest-ctrls at jlab.org
>>>>>> https://mailman.jlab.org/mailman/listinfo/lerftest-ctrls
>>>>>> Wiki: https://wiki.jlab.org/lerf/index.php/Network
>>>>>>
>>>>>>
>>>>
>>
>
More information about the Lerftest-ctrls
mailing list