August Maintenance Draft Schedule

Due to the situation around COVID-19, we had to reschedule our maintenance, originally planned for May, 2020 to August, 2020.  The situation is changing every day, but as of today, the maintenance is still scheduled for the week of August 3 - 7.

Affected services

The affected services include:

  • all HPC and Spear services, including login nodes, parallel storage, and compute nodes,
  • all Research Archival volumes,
  • all VMs, including those that are hosted for customers

Services not affected include:

  • Most data center hosting customers will remain online; we've already reached out and have been working with customers affected by the maintenance.

Scope of work

During this upgrade, we will perform upgrades to all major software on the HPC and Spear. Notable highlights include:

  1. upgrade the software that powers our parallel storage system (GPFS)
  2. perform hardware maintenance on the Research Archival System
  3. improve our power infrastructure
  4. upgrade our scheduler software, Slurm, to the latest version (v20.02 as of the time of this article)
  5. reorganize part of our network configuration and update firmware on all of our switches
  6. update the software on our database server
  7. optimize our HPC InfiniBand network

We originally reported that most services wouldn't be down for the entire week, but as we move closer to the scheduled maintenance date, we realize that is a practical improbably.  We will, however, notify you if any services can resume before we anticipate.

Draft schedule

We plan on sending out daily notices the entire week.  Also, this schedule is subject to change, but we will keep you notified if and when it does.

  • Friday, July 31 at 9am
    • We will begin draining HPC compute nodes and disable new job submissions.  This means that we will configure nodes to shut off one-by-one as all the jobs on that node complete.
  • Monday, August 3 at 7am
    • We will disable access to the following systems and services:
      • HPC Login nodes
      • Spear nodes
      • Export nodes (GPFS and Archival storage) and Globus
    • Lenovo consultants will begin maintenance on the storage system software (GPFS and Archival) promptly at 7am.  All users that wish to retrieve data off of the system should so by this time.
  • Tuesday, August 4 at 9am
    • Conditioned Air and Power will arrive to perform work on Power Distribution Unit "D".
    • Affected colocation customers have already been notified, and we are working with individual campus units to minimize impact. Nevertheless, send us a message if you have any concerns or questions.
  • Wednesday & Thursday, August  5 and 6
    • The above work will continue.
  • Friday, August 7 at 5pm
    • We expect all systems will be back online by this time, but we will let you know if any residual issues remain.

Questions or issues?

If we are able to provide access to any service earlier then expected, we will do so and notify you.

If you have any questions, issues, or requests, please let us know: