RESOLVED - Power Distribution Unit issue affecting HPC

UPDATE - 4pm - All of the affected nodes (see list below) are back online and operational.  Unfortunately, due to the nature of the problem, all jobs running on the affected nodes were killed.
We apologize for the inconvenience, and if we can do anything, please let us know (

We are experiencing an issue with a power distribution unit for several racks in the HPC.  Running jobs are affected on the following racks:
  1. M32
  2. I29
  3. I30
  4. I31
  5. I32
  6. I35
  7. I36
Jobs in the following partitions are affected:
  • backfill
  • backfill2
  • changlani_q
  • coaps18_q
  • eoas19_q
  • fraser_q
  • genacc_q
  • hongli_q
  • ktaylor_q
  • mecfd18_q
  • medicine_q
  • quicktest
  • rcc_internal
  • sec4m_q
  • stagg_q
  • stata_q
  • stroupe_q
  • yin19_q
In addition, the InfiniBand switch is affected, so jobs in other partitions may be affected as well.
The Systems Team has been deployed and we hope to have this issue resolved soon.  In the meantime, we'll post updates to this page as soon as we have them.