[ace] Rough Plan for Monday Power Outage

Anthony Cuffe cuffe at jlab.org
Fri Jun 9 14:51:07 EDT 2023


=================
Shutdown Summary
=================

Note that I will be on-site and will coordinate activities that overlap to ensure we are not stepping on each other.

7:15:

  *   Swap power supply in devmc02sw1 and power it from Bertha, turn off 2nd PS (Brad)

7:30:

  *   Move power for 1 power supply in opsmc02sw1 to Bertha, turn off 2nd PS (Brad)
  *   Move 1 power supply for opsfs, devfs and csmfs to Bertha (Anthony)
  *   Shutdown MYA Nodes (Chris)

7:45:

  *   Make a log entry and send out a notification email (Anthony)

8:00:

  *   Shutdown Non-Critical and Bare Metal Servers (Anthony)

8:20:

  *   Shutdown all Virtual Machines and then Hypervisors (Erik)
  *   Shutdown Database Systems (Theo/Anthony)

8:30

  *   Shutdown srffs, itffs, felfs and csml00 (Anthony)

8:40

  *   Force router switchover from VSS1 to VSS2.  (Brad)
  *   Force switchover from firewall1 to firewall2 (Brad)
  *   Shutdown remaining network items (Brad)

8:??

  *   Turn off rack UPSs to ensure recovery order and avoid surges
  *   Notify Facilities they can proceed with power work.

========
Recovery
========

Recover Network (Brad)

  *   Verify VSS1 and Firewall1 is up
  *   Force switchover form VSS2 to VSS1, verify force switchover from firewall2 to firewall1, verify
  *   opsmc02sw1 - turn on 2nd PS, move power for 1st supply for back to ups power
  *   devmc02sw1 - turn on second PS, swap first supply back to original and connect to ups
  *   Verify all network switches in MCC are up (script)

Recover Systems: (Anthony, Erik and Theo)

  *   Move  PS for opsfs, devfs and csmfs back to ups (Anthony)
  *   Recover srffs, felfs, itffs and database systems (Theo/Anthony)
  *   Recover VMware nodes and VMs (Erik)
  *   Recover MYA nodes (Chris/Anthony)
  *   Recover remaining UPSs/systems (Theo/Erik/Anthony)

Post Recovery: (Team)

  *   Verify database and web services. (Theo L. and Ryan)
  *   Reboot angry remote systems.
  *   Restart/Verify other services using nagios, pingnode, etc ...
  *   Make logbook entry and email users.
  *   Assist ACS with IOC recovery operations.
  *   Coolie at Magic Mushroom




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/ace/attachments/20230609/eddb47fe/attachment-0001.html>


More information about the ace mailing list